[I] [Feature Discuss] Partition-Level Bucket Index [hudi]

via GitHub Fri, 24 Jan 2025 01:20:36 -0800


zhangyue19921010 opened a new issue, #12699:
URL: https://github.com/apache/hudi/issues/12699


   Hi Hudis:
   
   As we known, Hudi proposed and introduced Bucket Index in RFC-29. Bucket 
Index can well unify the indexes of Flink and Spark, that is, Spark and Flink 
could upsert the same Hudi table using bucket index.
   
   However, Bucket Index has a limit of fixed number of buckets. In order to 
solve this problem, RFC-42 proposed the ability of consistent hashing achieving 
bucket resizing by splitting or merging several local buckets dynamically.
   
   But from PRD experience, sometimes we only need to implement the 
Partition-Level Bucket Index and could do offline bucket rescale without 
introducing additional efforts (multiple writes, clustering, automatic 
resizing,etc.). Because the more complex the Architecture, the more error-prone 
it is and the greater the operation and maintenance pressure.
   
   In this regard, I want to **upgrade the traditional Bucket Index to 
implement a Partition-Level Bucket Index,** so that users can set a specific 
number of buckets for different partitions through a rule engine (such as 
regular expression matching). On the other hand, for a certain existing 
partitions, an off-line command is provided to reorganized the data using 
insert overwrite(need to stop the data writing of the current partition).
   
   Some thoughts on this change? Any feedback would be greatly appreciated !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature Discuss] Partition-Level Bucket Index [hudi]

Reply via email to