[ 
https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-22105:
-------------------------------
    Description: 
Suppose we have a dataframe with many columns (e.g 100 columns), each column is 
DoubleType.
And we need to compute avg on each column. We will find using dataframe avg 
will be much slower than using RDD.aggregate.

I observe this issue from this PR: (One pass imputer)
https://github.com/apache/spark/pull/18902

I also write a minimal testing code to reproduce this issue, I use computing 
sum to reproduce this issue:
https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1

When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be about 
3x slower than `RDD.aggregate`, but if we only compute one column, dataframe 
avg will be much faster than `RDD.aggregate`.

The reason of this issue, should be the defact in dataframe codegen. Codegen 
will inline everything and generate large code block. When the column number is 
large (e.g 100 columns), the codegen size will be too large, which cause jvm 
failed to JIT and fall back to byte code interpretation.
This PR should address this issue:
https://github.com/apache/spark/pull/19082
But we need more performance test against some code in ML after above PR 
merged, to check whether this issue is actually fixed.

This JIRA used to track this performance issue.


  was:
Suppose we have a dataframe with many columns (e.g 100 columns), each column is 
DoubleType.
And we need to compute avg on each column. We will find using dataframe avg 
will be much slower than using RDD.aggregate.

I observe this issue from this PR: (One pass imputer)
https://github.com/apache/spark/pull/18902

I also write a minimal testing code to reproduce this issue, I use computing 
sum to reproduce this issue:
https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1

When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be about 
3x slower than `RDD.aggregate`, but if we only compute one column, dataframe 
avg will be much faster than `RDD.aggregate`.

The reason of this issue, should be the defact in dataframe codegen. Codegen 
will inline everything and generate large code block. When the column number is 
large (e.g 100 columns), the codegen size will be too large, which cause jvm 
failed to JIT and fall back to byte code interpretation.
This PR should address this issue:
https://github.com/apache/spark/pull/19082
But we need more performance code against some code in ML after above PR 
merged, to check whether this issue is actually fixed.

This JIRA used to track this performance issue.



> Dataframe has poor performance when computing on many columns with codegen
> --------------------------------------------------------------------------
>
>                 Key: SPARK-22105
>                 URL: https://issues.apache.org/jira/browse/SPARK-22105
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SQL
>    Affects Versions: 2.3.0
>            Reporter: Weichen Xu
>            Priority: Minor
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column 
> is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg 
> will be much slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing 
> sum to reproduce this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be 
> about 3x slower than `RDD.aggregate`, but if we only compute one column, 
> dataframe avg will be much faster than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen 
> will inline everything and generate large code block. When the column number 
> is large (e.g 100 columns), the codegen size will be too large, which cause 
> jvm failed to JIT and fall back to byte code interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance test against some code in ML after above PR 
> merged, to check whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to