wqwl611 opened a new pull request, #6636:
URL: https://github.com/apache/hudi/pull/6636
### Change Logs
Usually, in the mysql... table, there is an auto-increment primary key, base
this fact and Bucket index , we propose a Range_Bucket index. And get a good
performence in my practice.
Base concept is like Bucket index, and most important is: bucketId =
primaryKey / stepSize
For example, if set stepSize = 2
bucketId Mapping will like this:
pKey, bucketId
1, 0
2, 0
3, 1
4, 1
5, 2
6, 2
7, 3
...
In my practice, I sync about 1000+ mysql table to hudi, and I set setpSize =
1,500,000, and basefile size will be about 50m - 350m.
Test like this, is set stepSize = 2 and pKey = {1, 4, 9}, will get three
base file:
<img width="762" alt="image"
src="https://user-images.githubusercontent.com/67826098/189126613-a50e3347-900f-4477-a4e8-f935bd996ebd.png">
### Impact
Introduce a new index RANGE_BUCKET, people can ust it like following:
option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
option(HoodieLayoutConfig.LAYOUT_PARTITIONER_CLASS_NAME.key,
classOf[SparkRangeBucketIndexPartitioner[_]].getName).
**Risk level: none | low | medium | high**
low
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]