[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

rxin Tue, 18 Sep 2018 17:17:51 -0700

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16677
  
    actually looking at the design - this could cause perf regressions in some 
cases too right? it introduces a barrier that was previously non-existent. if 
the number of records to take isn't substantially less than the actual records 
on each partition, perf would be much worse. also it feels to me this isn't 
shuffle at all, and we are piggybacking on the wrong infrastructure. what you 
really want is a way to buffer blocks temporarily, and can launch a 2nd wave of 
tasks to rerun some of them.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

Reply via email to