[jira] [Resolved] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

Hyukjin Kwon (Jira) Mon, 07 Oct 2019 22:45:58 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-22105.
----------------------------------
    Resolution: Incomplete

> Dataframe has poor performance when computing on many columns with codegen
> --------------------------------------------------------------------------
>
>                 Key: SPARK-22105
>                 URL: https://issues.apache.org/jira/browse/SPARK-22105
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SQL
>    Affects Versions: 2.3.0
>            Reporter: Weichen Xu
>            Priority: Minor
>              Labels: bulk-closed
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column 
> is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg 
> will be much slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing 
> sum to reproduce this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be 
> about 3x slower than `RDD.aggregate`, but if we only compute one column, 
> dataframe avg will be much faster than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen 
> will inline everything and generate large code block. When the column number 
> is large (e.g 100 columns), the codegen size will be too large, which cause 
> jvm failed to JIT and fall back to byte code interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance test against some code in ML after above PR 
> merged, to check whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen

Reply via email to