akpatnam25 opened a new pull request #33644:
URL: https://github.com/apache/spark/pull/33644


   What changes were proposed in this pull request?
   Move final iteration of aggregation of RDD.treeAggregate to an executor with 
one partition and fetch that result to the driver
   Why are the changes needed?
   RDD.fold pulls all shuffle partitions to the driver to merge the result
   a. Driver becomes a single point of failure in the case that there are a lot 
of partitions to do the final aggregation on
   Shuffle machinery at executors is much more robust/fault tolerant compared 
to fetching results to driver.
   Does this PR introduce any user-facing change?
   The previous behavior always did the final aggregation in the driver. The 
user can now (optionally) provide a boolean config (default = false) 
ENABLE_EXECUTOR_TREE_AGGREGATE to do that final aggregation in a single 
partition executor before fetching the results to the driver. The only 
additional cost is that the user will see an extra stage in their job.
   
   How was this patch tested?
   This patch was tested via unit tests, and also tested on a cluster.
   The screenshots showing the extra stage on a cluster are attached below 
(before vs after).
   
![before](https://user-images.githubusercontent.com/24758726/128249830-eefc4bda-f737-4d68-960e-1d1907762538.png)
   
![after](https://user-images.githubusercontent.com/24758726/128249838-be70bc95-9f39-489c-be17-c9c80c4846a4.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to