Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/14854
  
    Another advantage of this PR's approach is straggler-resilience: imagine 
that you're going to have to compute all partitions anyways and also assume 
that some small number of partitions will dominate the total computation time. 
If you take, say, a 1000 partition job and break it into ten 100 partition jobs 
and assume that there are 10 straggler tasks total, then splitting the total 
work into a sequence of multiple jobs increases the likelihood that those 
stragglers will run in separate jobs, causing their total runtimes to be added 
together when determining the total runtime of the action. In contrast, 
submitting all of those tasks as part of the same job allows them to overlap 
with each other and run in parallel once all of the smaller tasks have 
finished, resulting in a better worst-case runtime.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to