wqwl611 opened a new pull request, #6858:
URL: https://github.com/apache/hudi/pull/6858

   ### Change Logs
   The rangeBucket is mainly used in the scenario of sync mysql tables to hudi 
in near real time, which avoids the disadvantage of the fixed number of buckets 
in simpleBucket.
   
   Usually, in the mysql table, there is an auto-increment id primary key 
field. In the mysql cdc synchronization scenario, we can use the database name 
and table name as the partition field of hudi, and id as the primary key field 
of the hudi table,This can deal with sub-library and sub-table. 
   
   In order to reach better sync performance, we usually use bucket index, but 
if we use simple bucket index, because the number of buckets is fixed, it is 
difficult for us to determine a suitable number of buckets, and as the table 
grows, The previous number of buckets will no longer be appropriate. 
   
   So, I propose rangeBucekt, in the simpleBucket index, the bucket number is 
(hash % bucketNum), and in rangetBucket, we will use ( id / fixedStep) to 
determine the bucket number, so that as the id grows,The number of buckets also 
increases. For example, if step = 10 is set, then, because the id is 
self-increasing, a bucket will be generated for every 10 pieces of data.
   
   In the actual scenario, I set step=1,000,000, the usual size of each mysql 
record is similar, then the approximate size of each bucket will be 50M ~ 350M, 
which avoids the disadvantage of the fixed number of buckets in simpleBucket
   
   
   
   ### Impact
   
   Introduce a new index RANGE_BUCKET, people can ust it like following:
   
         option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
         option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
         option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
         option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
   
   **Risk level: none | low | medium | high**
   
   low
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to