izchen opened a new pull request #29028:
URL: https://github.com/apache/spark/pull/29028


   …esults in executor or driver
   
   ### What changes were proposed in this pull request?
   Add an optional config to determine whether the intermediate results(local 
TopK) of each partition in RDD.takeOrdered will be merged in driver process or 
executor process. If set to true, merge in driver process(by 
util.PriorityQueue), which will get shorter waiting time for return. But if the 
intermediate results are too large and too many partitions, the intermediate 
results may accumulate in the memory of the driver process, causing excessive 
memory pressure. If set to false, merge in executor process(by 
guava.QuickSelect), intermediate results will not accumulate in memory, but 
will cause longer runtimes.
   
   ### Why are the changes needed?
   The problem with original implementation is that if the intermediate results 
are too large and too many partitions, the intermediate results may accumulate 
in the memory of the driver process, causing excessive memory pressure.
   
   ### Does this PR introduce _any_ user-facing change?
   Adds configuration parameter
   "spark.rdd.takeOrdered.mergeInDriver" (default: true)
   
   ### How was this patch tested?
   Added UTs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to