izchen opened a new pull request #29028: URL: https://github.com/apache/spark/pull/29028
…esults in executor or driver ### What changes were proposed in this pull request? Add an optional config to determine whether the intermediate results(local TopK) of each partition in RDD.takeOrdered will be merged in driver process or executor process. If set to true, merge in driver process(by util.PriorityQueue), which will get shorter waiting time for return. But if the intermediate results are too large and too many partitions, the intermediate results may accumulate in the memory of the driver process, causing excessive memory pressure. If set to false, merge in executor process(by guava.QuickSelect), intermediate results will not accumulate in memory, but will cause longer runtimes. ### Why are the changes needed? The problem with original implementation is that if the intermediate results are too large and too many partitions, the intermediate results may accumulate in the memory of the driver process, causing excessive memory pressure. ### Does this PR introduce _any_ user-facing change? Adds configuration parameter "spark.rdd.takeOrdered.mergeInDriver" (default: true) ### How was this patch tested? Added UTs. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
