[jira] [Updated] (HUDI-4824) add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

wl (Jira) Fri, 09 Sep 2022 10:21:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wl updated HUDI-4824:
---------------------
    Status: In Progress  (was: Open)

> add new index RANGE_BUCKET , when primary key is auto-increment like most 
> mysql table
> -------------------------------------------------------------------------------------
>
>                 Key: HUDI-4824
>                 URL: https://issues.apache.org/jira/browse/HUDI-4824
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: wl
>            Priority: Major
>
> h3. Change Logs
> Usually, in the mysql... table, there is an auto-increment primary key, base 
> this fact and Bucket index , we propose a Range_Bucket index. And get a good 
> performence in my practice.
> Base concept is like Bucket index, and most important is: bucketId = 
> primaryKey / stepSize
> For example, if set stepSize = 2
> bucketId Mapping will like this:
> pKey, bucketId
> 1, 0
> 2, 0
> 3, 1
> 4, 1
> 5, 2
> 6, 2
> 7, 3
> ...
> In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize = 
> 1,500,000, and basefile size will be about 50m - 350m.
> Test like this, is set stepSize = 2 and pKey = \{1, 4, 9}, will get three 
> base file:
> [!https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png|width=762!|https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png]
> h3. Impact
> Introduce a new index RANGE_BUCKET, people can ust it like following:
>  {{  option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
>   option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
>   option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
>   option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
>   option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key, 
> classOf[SparkRangeBucketIndexPartitioner[_]].getName).}}
> *Risk level: none | low | medium | high*
> low



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4824) add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

Reply via email to