zhengchenyu commented on PR #44512: URL: https://github.com/apache/spark/pull/44512#issuecomment-1871886874
@Ngone51 Thanks for you review! The combine is the key problem. I still combine, but combine in ExternalSorter will never trigger extra spill. Combine happen in insertAllAndUpdateMetrics and merge: (1) combine when insert, just read the exist key and combine the value, do it in memory. (2) combine when merge, we sort the combined iterator, combine util next key occur, also do it in memory. The two processes are not additional operations and are originally performed in the ExternalSorter. Before the change, We use ExternalAppendOnlyMap to combine. If the memory is over threshold, will spill to disk. Then we use ExternalSorter to sort. If the memory is over threshold, will spill to disk. **It means we may spill twice(ExternalAppendOnlyMap and ExternalSorter), if shuffle data is huge.** We know that we can combine during the sorting process. After the change, we will use ExternalSorter, combine occurs during insert and merge processes. So there is no need to call ExternalAppendOnlyMap to combine. **It means only spill once (ExternalSorter).** Spill is time consuming. For sort and combine shuffle, when shuffle date is huge, this PR will save time。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
