Re: [PR] [#1239] Remote merge on the shuffle server side. [incubator-uniffle]

via GitHub Mon, 22 Apr 2024 19:26:30 -0700


zhengchenyu commented on PR #1660:
URL: 
https://github.com/apache/incubator-uniffle/pull/1660#issuecomment-2071292801


   @advancedxy @zuston 
   (1) Remote Merge doesn't just work on sort, it also works on combine
   You're paying too much attention to sort. In fact, merge may contain sort or 
combine. For spark, I think combine is more genernal. Remote merge also solves 
the problem of spilling data when combining. For RDD that requires combine but 
does not require sort, The shuffle server uses hash(key) for sorting. This will 
keep the same keys organized together as much as possible. On the reduce side, 
we can combine in memory.
   (2) About sort
   I haven't investigated SortExec yet, and I will investigate next. But I 
guess the reason for introducing sortexec is to save stage. If this is the 
case, then we can add stage. BTW, If the key of new shuffle is same with the 
prev, we can get better performance[SPARK-46512].
   BTW, Although spark is a better computing framework, it is still based on 
mapreduce mechanism, which is no different from hadoop mapreduce and tez.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [#1239] Remote merge on the shuffle server side. [incubator-uniffle]

Reply via email to