[
https://issues.apache.org/jira/browse/SPARK-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267159#comment-15267159
]
Yin Huai commented on SPARK-13749:
----------------------------------
https://github.com/apache/spark/pull/11583 has been merged. I will resolve this
jira once the follow-up PR is in.
> Faster pivot implementation for many distinct values with two phase
> aggregation
> -------------------------------------------------------------------------------
>
> Key: SPARK-13749
> URL: https://issues.apache.org/jira/browse/SPARK-13749
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Andrew Ray
>
> The existing implementation of pivot translates into a single aggregation
> with one aggregate per distinct pivot value. When the number of distinct
> pivot values is large (say 1000+) this can get extremely slow since each
> input value gets evaluated on every aggregate even though it only affects the
> value of one of them.
> I'm proposing an alternate strategy for when there are 10+ (somewhat
> arbitrary threshold) distinct pivot values. We do two phases of aggregation.
> In the first we group by the grouping columns plus the pivot column and
> perform the specified aggregations (one or sometimes more). In the second
> aggregation we group by the grouping columns and use the new (non public)
> PivotFirst aggregate that rearranges the outputs of the first aggregation
> into an array indexed by the pivot value. Finally we do a project to extract
> the array entries into the appropriate output column.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]