wqwl611 opened a new pull request, #6858:
URL: https://github.com/apache/hudi/pull/6858
### Change Logs
The rangeBucket is mainly used in the scenario of sync mysql tables to hudi
in near real time, which avoids the disadvantage of the fixed number of buckets
in simpleBucket.
Usually, in the mysql table, there is an auto-increment id primary key
field. In the mysql cdc synchronization scenario, we can use the database name
and table name as the partition field of hudi, and id as the primary key field
of the hudi table,This can deal with sub-library and sub-table.
In order to reach better sync performance, we usually use bucket index, but
if we use simple bucket index, because the number of buckets is fixed, it is
difficult for us to determine a suitable number of buckets, and as the table
grows, The previous number of buckets will no longer be appropriate.
So, I propose rangeBucekt, in the simpleBucket index, the bucket number is
(hash % bucketNum), and in rangetBucket, we will use ( id / fixedStep) to
determine the bucket number, so that as the id grows,The number of buckets also
increases. For example, if step = 10 is set, then, because the id is
self-increasing, a bucket will be generated for every 10 pieces of data.
In the actual scenario, I set step=1,000,000, the usual size of each mysql
record is similar, then the approximate size of each bucket will be 50M ~ 350M,
which avoids the disadvantage of the fixed number of buckets in simpleBucket
### Impact
Introduce a new index RANGE_BUCKET, people can ust it like following:
option(HoodieIndexConfig.INDEX_TYPE.key, IndexType.BUCKET.name()).
option(HoodieIndexConfig.BUCKET_INDEX_ENGINE_TYPE.key, "RANGE_BUCKET").
option(HoodieIndexConfig.RANGE_BUCKET_STEP_SIZE.key, 2).
option(HoodieLayoutConfig.LAYOUT_TYPE.key, "BUCKET").
**Risk level: none | low | medium | high**
low
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]