Re: [I] [Feature Discuss] Partition-Level Bucket Index [hudi]

via GitHub Fri, 07 Feb 2025 03:34:59 -0800


zhangyue19921010 commented on issue #12699:
URL: https://github.com/apache/hudi/issues/12699#issuecomment-2642684431


   > [@zhangyue19921010](https://github.com/zhangyue19921010) We have 
implemented dynamic partition bucketing, which supports regular expressions, 
similar to your idea. The only difference is that we store bucket information 
in the ./hoodie/.bucket directory. Since the bucket information is minimal, 
it's efficient to store it in a single file. This approach simplifies the 
process of retrieving partition-level bucket counts and performing bucket 
pruning. At the same time, with the help of Hudi's timeline, we can easily 
ensure the consistency of bucket information
   
   Hi @xiarixiaoyao Thanks for your replay! It seems that dynamic 
partition-level bucket index is indeed a common requirement.
   `./hoodie/.bucket directory` is a good idea. But how to solve the problem of 
two jobs concurrently writing? At this time, there may be multiple tasks 
operating on the partition meta file(Even if wrote to different partitions)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Feature Discuss] Partition-Level Bucket Index [hudi]

Reply via email to