wqwl611 opened a new pull request, #6636:
URL: https://github.com/apache/hudi/pull/6636

   ### Change Logs
   Usually, in the mysql... table, there is an auto-increment primary key, base 
this fact and  Bucket index , we propose a Range_Bucket index. And get a good 
performence in my practice.
   
   Base concept is like Bucket index, and most important is: bucketId = 
primaryKey / stepSize
   For example, if set stepSize = 2
   bucketId Mapping will like this:
   pKey, bucketId
     1,  0
     2, 0
     3, 1
     4, 1
     5, 2
     6, 2
     7, 3
     ...
   
   In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize = 
1,500,000, and basefile size will be about 50m - 350m.
   
   Test like this, is set stepSize = 2 and pKey = {1, 4, 9}, will get three 
base file:
   <img width="762" alt="image" 
src="https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png";>
   
   
   
   
   
   
   ### Impact
   
   Introduce a new index RANGE_BUCKET, people can ust it like following:
   
         option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
         option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
         option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
         option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
         option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key, 
classOf[SparkRangeBucketIndexPartitioner[_]].getName).
   
   
   **Risk level: none | low | medium | high**
   
   low
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to