[
https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian Feng updated HUDI-5560:
----------------------------
Summary: Make Consistent hash index Bucket Resizing more available on real
cases (was: Make Consistent hash index more available on real cases )
> Make Consistent hash index Bucket Resizing more available on real cases
> ------------------------------------------------------------------------
>
> Key: HUDI-5560
> URL: https://issues.apache.org/jira/browse/HUDI-5560
> Project: Apache Hudi
> Issue Type: Improvement
> Components: index
> Reporter: Jian Feng
> Priority: Major
>
> Bucket Resizing (Splitting & Merging)
> Considering there is a semantic similarity between bucket resizing and
> clustering (i.e., re-organizing small data files), this proposal plans to
> integrate the resizing process as a subtask into the clustering service. The
> trigger condition for resizing directly depends on the file size, where small
> files will be merged and large files will be split.
> For merging files, we require that the buckets should be adjacent to each
> other in terms of their hash ranges so that the output bucket has only one
> continuous hash range. Though it is not required in a standard Consistent
> Hashing algorithm, fragmentations in hash ranges may cause extra complexity
> for the splitting process in our case.
> For splitting files, a split point (i.e., hash ranges for the output buckets)
> should be decided:
> A simple policy would be split in the range middle, but it may produce uneven
> data files. In an extreme case, splitting may produce one file with all data
> and one file with no data.
> Another policy is to find a split point that evenly dispatches records into
> children buckets. It requires knowledge about the hash value distribution of
> the original buckets.
> In our implementation, we will first stick to the first simple one, as
> buckets will finally converge to a balanced distribution after multiple
> rounds of resizing processes. Of course, a pluggable implementation will be
> kept for extensibility so that users can choose different available policies.
> All updates related to the hash metadata will be first recorded in the
> clustering plan, and then be reflected in partitions' hashing metadata when
> clustering finishes. The plan is generated and persisted in files during the
> scheduling process, which is protected by a table-level lock for a consistent
> table view.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)