[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

rxin Tue, 18 Sep 2018 21:45:26 -0700

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/16677
  
    ok after thinking about it more, i think we should just revert all of these 
changes and go back to the drawing board. here's why:
    
    1. the prs change some of the most common/core parts of spark, and are not 
properly designed (as in they haven't gone through actual discussions; there's 
not even a doc on how they work). the prs created a much more complicated 
implementations for limit / top k. you might be able to justify the complexity 
with the perf improvements, but we better write them down, discuss them, and 
make sure they are the right design choices. this is just a comment about the 
process, not the actual design.
    
    2. now onto the design, i am having issues with two major parts:
    
    2a. this pr really wanted an abstraction to buffer data, and then have the 
driver analyze some statistics about data (records per map task), and then make 
decisions. because spark doesn't yet have that infrastructure, this pr just 
adds some hacks to shuffle to make it work. there is no proper abstraction here.
    
    2b. i'm not even sure if the algorithm here is the right one. the pr tries 
to parallelize as much as possible by keeping the same number of tasks. imo a 
simpler design that would work for more common cases is to buffer the data, get 
the records per map task, and create a new rdd with the first N number of 
partitions that reach limit. that way, we don't launch too many asks, and we 
retain ordering.
    
    3. the pr implementation quality is poor. variable names are confusing 
(output vs records); it's severely lacking documentation; the doc for the 
config option is arcane.
    
    sorry about all of the above, but we gotta do better.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

Reply via email to