Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16677
ok after thinking about it more, i think we should just revert all of these
changes and go back to the drawing board. here's why:
1. the prs change some of the most common/core parts of spark, and are not
properly designed (as in they haven't gone through actual discussions; there's
not even a doc on how they work). the prs created a much more complicated
implementations for limit / top k. you might be able to justify the complexity
with the perf improvements, but we better write them down, discuss them, and
make sure they are the right design choices. this is just a comment about the
process, not the actual design.
2. now onto the design, i am having issues with two major parts:
2a. this pr really wanted an abstraction to buffer data, and then have the
driver analyze some statistics about data (records per map task), and then make
decisions. because spark doesn't yet have that infrastructure, this pr just
adds some hacks to shuffle to make it work. there is no proper abstraction here.
2b. i'm not even sure if the algorithm here is the right one. the pr tries
to parallelize as much as possible by keeping the same number of tasks. imo a
simpler design that would work for more common cases is to buffer the data, get
the records per map task, and create a new rdd with the first N number of
partitions that reach limit. that way, we don't launch too many asks, and we
retain ordering.
3. the pr implementation quality is poor. variable names are confusing
(output vs records); it's severely lacking documentation; the doc for the
config option is arcane.
sorry about all of the above, but we gotta do better.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]