Re: [PR] [HUDI-9379] Spark RemotePartitioner For BucketIndex [hudi]

via GitHub Sat, 10 May 2025 21:48:15 -0700


zhangyue19921010 commented on PR #13265:
URL: https://github.com/apache/hudi/pull/13265#issuecomment-2869416537


   > If there is no centralized allocation of partition IDs, it is indeed 
impossible to fully handle the issue of partition key conflicts. However, this 
algorithm has a drawback, which may lead to an excessive number of reduce 
tasks. When the data record is large and there is no remote shuffle service, it 
is likely to cause the shuffle read to fail. If it is enabled by default, it 
may be necessary to supplement in the documentation what problems may be caused.
   
   Hi, thanks for your response. Indeed, without a centralized planner, 
addressing skewness completely—especially in large-scale data scenarios—is 
challenging. There’s one point I’m not entirely clear about: when comparing 
RemotePartitioner and LocalPartitioner, how do they differ in terms of shuffle 
data volume? My understanding is that in a bucket index scenario, the amount of 
shuffled data should be the same regardless of which partitioner is used, as 
both rely on a global shuffle based on the bucket ID.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9379] Spark RemotePartitioner For BucketIndex [hudi]

Reply via email to