wl created HUDI-4824:
------------------------

             Summary: add new index RANGE_BUCKET , when primary key is 
auto-increment like most mysql table
                 Key: HUDI-4824
                 URL: https://issues.apache.org/jira/browse/HUDI-4824
             Project: Apache Hudi
          Issue Type: New Feature
            Reporter: wl


h3. Change Logs

Usually, in the mysql... table, there is an auto-increment primary key, base 
this fact and Bucket index , we propose a Range_Bucket index. And get a good 
performence in my practice.

Base concept is like Bucket index, and most important is: bucketId = primaryKey 
/ stepSize
For example, if set stepSize = 2
bucketId Mapping will like this:
pKey, bucketId
1, 0
2, 0
3, 1
4, 1
5, 2
6, 2
7, 3
...

In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize = 
1,500,000, and basefile size will be about 50m - 350m.

Test like this, is set stepSize = 2 and pKey = \{1, 4, 9}, will get three base 
file:
[!https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png|width=762!|https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png]
h3. Impact

Introduce a new index RANGE_BUCKET, people can ust it like following:
 {{  option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
  option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
  option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
  option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
  option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key, 
classOf[SparkRangeBucketIndexPartitioner[_]].getName).}}
*Risk level: none | low | medium | high*

low



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to