[jira] [Updated] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

Frank Rosner (JIRA) Thu, 22 Oct 2015 01:59:11 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank Rosner updated SPARK-11258:
---------------------------------
    Description: 
h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.

  was:
h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.


> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-11258
>                 URL: https://issues.apache.org/jira/browse/SPARK-11258
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 1.5.1
>            Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

Reply via email to