Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16677
actually looking at the design - this could cause perf regressions in some
cases too right? it introduces a barrier that was previously non-existent. if
the number of records to take isn't substantially less than the actual records
on each partition, perf would be much worse. also it feels to me this isn't
shuffle at all, and we are piggybacking on the wrong infrastructure. what you
really want is a way to buffer blocks temporarily, and can launch a 2nd wave of
tasks to rerun some of them.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]