Github user JoshRosen commented on the issue:
https://github.com/apache/spark/pull/14854
Another advantage of this PR's approach is straggler-resilience: imagine
that you're going to have to compute all partitions anyways and also assume
that some small number of partitions will dominate the total computation time.
If you take, say, a 1000 partition job and break it into ten 100 partition jobs
and assume that there are 10 straggler tasks total, then splitting the total
work into a sequence of multiple jobs increases the likelihood that those
stragglers will run in separate jobs, causing their total runtimes to be added
together when determining the total runtime of the action. In contrast,
submitting all of those tasks as part of the same job allows them to overlap
with each other and run in parallel once all of the smaller tasks have
finished, resulting in a better worst-case runtime.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]