[ 
https://issues.apache.org/jira/browse/SPARK-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267159#comment-15267159
 ] 

Yin Huai commented on SPARK-13749:
----------------------------------

https://github.com/apache/spark/pull/11583 has been merged. I will resolve this 
jira once the follow-up PR is in.

> Faster pivot implementation for many distinct values with two phase 
> aggregation
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-13749
>                 URL: https://issues.apache.org/jira/browse/SPARK-13749
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Andrew Ray
>
> The existing implementation of pivot translates into a single aggregation 
> with one aggregate per distinct pivot value. When the number of distinct 
> pivot values is large (say 1000+) this can get extremely slow since each 
> input value gets evaluated on every aggregate even though it only affects the 
> value of one of them.
> I'm proposing an alternate strategy for when there are 10+ (somewhat 
> arbitrary threshold) distinct pivot values. We do two phases of aggregation. 
> In the first we group by the grouping columns plus the pivot column and 
> perform the specified aggregations (one or sometimes more). In the second 
> aggregation we group by the grouping columns and use the new (non public) 
> PivotFirst aggregate that rearranges the outputs of the first aggregation 
> into an array indexed by the pivot value. Finally we do a project to extract 
> the array entries into the appropriate output column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to