mridulm edited a comment on pull request #33644: URL: https://github.com/apache/spark/pull/33644#issuecomment-894320564
@srowen You are right that the total computation is slightly worse (an additional fold with zero value at driver and ofcourse the extra shuffle). The main issue is what @venkata91 mentioned - at executor, it is done in context of shuffle, where memory is better managed (not necessarily because executor has more memory). At driver, it is essentially fetching all partitions (single value per partition) as part of the `fold`'s `runJob` : for nontrivial values, this causes OOM at driver, failing the application (memory usage for io + N serde in addition to the fold). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
