[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

GitBox Tue, 18 Aug 2020 18:35:54 -0700


Karl-WangSK commented on pull request #29360:
URL: https://github.com/apache/spark/pull/29360#issuecomment-675801089



   > > But shuffle is happened during Aggregate here, right? By splitting, the 
total amount of shuffled data is not changed, but split into several ones. Does 
it really result significant improvement?
   > 
   > As @viirya said above, I think the same. Why can this reduce the amount of 
shuffle writes (and improve the performance)? In the case of `expand -> partial 
aggregates`, the aggregates seem to have the same **total** amount of output 
size.
   
   In my idea and according to benchmark.I think the when data size is larger 
than execution memory ,then it will cache in disk ,so it losses the performance 
and increase the time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Karl-WangSK commented on pull request #29360: [SPARK-32542][SQL] Add an optimizer rule to split an Expand into multiple Expands for aggregates

Reply via email to