Github user rxin commented on the issue: https://github.com/apache/spark/pull/16677 ok after thinking about it more, i think we should just revert all of these changes and go back to the drawing board. here's why: 1. the prs change some of the most common/core parts of spark, and are not properly designed (as in they haven't gone through actual discussions; there's not even a doc on how they work). the prs created a much more complicated implementations for limit / top k. you might be able to justify the complexity with the perf improvements, but we better write them down, discuss them, and make sure they are the right design choices. this is just a comment about the process, not the actual design. 2. now onto the design, i am having issues with two major parts: 2a. this pr really wanted an abstraction to buffer data, and then have the driver analyze some statistics about data (records per map task), and then make decisions. because spark doesn't yet have that infrastructure, this pr just adds some hacks to shuffle to make it work. there is no proper abstraction here. 2b. i'm not even sure if the algorithm here is the right one. the pr tries to parallelize as much as possible by keeping the same number of tasks. imo a simpler design that would work for more common cases is to buffer the data, get the records per map task, and create a new rdd with the first N number of partitions that reach limit. that way, we don't launch too many asks, and we retain ordering. 3. the pr implementation quality is poor. variable names are confusing (output vs records); it's severely lacking documentation; the doc for the config option is arcane. sorry about all of the above, but we gotta do better.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org