zhengchenyu commented on PR #1660: URL: https://github.com/apache/incubator-uniffle/pull/1660#issuecomment-2071292801
@advancedxy @zuston (1) Remote Merge doesn't just work on sort, it also works on combine You're paying too much attention to sort. In fact, merge may contain sort or combine. For spark, I think combine is more genernal. Remote merge also solves the problem of spilling data when combining. For RDD that requires combine but does not require sort, The shuffle server uses hash(key) for sorting. This will keep the same keys organized together as much as possible. On the reduce side, we can combine in memory. (2) About sort I haven't investigated SortExec yet, and I will investigate next. But I guess the reason for introducing sortexec is to save stage. If this is the case, then we can add stage. BTW, If the key of new shuffle is same with the prev, we can get better performance[SPARK-46512]. BTW, Although spark is a better computing framework, it is still based on mapreduce mechanism, which is no different from hadoop mapreduce and tez. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
