zhangyue19921010 commented on PR #13265: URL: https://github.com/apache/hudi/pull/13265#issuecomment-2869416537
> If there is no centralized allocation of partition IDs, it is indeed impossible to fully handle the issue of partition key conflicts. However, this algorithm has a drawback, which may lead to an excessive number of reduce tasks. When the data record is large and there is no remote shuffle service, it is likely to cause the shuffle read to fail. If it is enabled by default, it may be necessary to supplement in the documentation what problems may be caused. Hi, thanks for your response. Indeed, without a centralized planner, addressing skewness completely—especially in large-scale data scenarios—is challenging. There’s one point I’m not entirely clear about: when comparing RemotePartitioner and LocalPartitioner, how do they differ in terms of shuffle data volume? My understanding is that in a bucket index scenario, the amount of shuffled data should be the same regardless of which partitioner is used, as both rely on a global shuffle based on the bucket ID. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
