mridulm edited a comment on pull request #33644:
URL: https://github.com/apache/spark/pull/33644#issuecomment-894320564


   @srowen You are right that the total computation is slightly worse (an 
additional fold with zero value at driver and ofcourse the extra shuffle).
   
   The main issue is what @venkata91 mentioned - at executor, it is done in 
context of shuffle, where memory is better managed (not necessarily because 
executor has more memory).
   At driver, it is essentially fetching all partitions (single value per 
partition) as part of the `fold`'s `runJob` : for nontrivial values, this 
causes OOM at driver, failing the application (memory usage for io + N serde in 
addition to the fold).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to