[
https://issues.apache.org/jira/browse/HUDI-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian Feng updated HUDI-5560:
----------------------------
Description:
Let take a look at [Consistent Hash Index RFC: Bucket Resizing|
https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging]
I Copy Bucket resizing part as below:
{panel:title=*Bucket Resizing (Splitting & Merging)*}
Considering there is a semantic similarity between bucket resizing and
clustering (i.e., re-organizing small data files), this proposal plans to
integrate the resizing process as a subtask into the clustering service. The
trigger condition for resizing directly depends on the file size, where small
files will be merged and large files will be split.
For merging files, we require that the buckets should be adjacent to each other
in terms of their hash ranges so that the output bucket has only one continuous
hash range. Though it is not required in a standard Consistent Hashing
algorithm, fragmentations in hash ranges may cause extra complexity for the
splitting process in our case.
For splitting files, a split point (i.e., hash ranges for the output buckets)
should be decided:
* A simple policy would be split in the range middle, but it may produce uneven
data files. In an extreme case, splitting may produce one file with all data
and one file with no data.
* Another policy is to find a split point that evenly dispatches records into
children buckets. It requires knowledge about the hash value distribution of
the original buckets.
*In our implementation, we will first stick to the first simple one, as buckets
will finally converge to a balanced distribution after multiple rounds of
resizing processes. Of course, a pluggable implementation will be kept for
extensibility so that users can choose different available policies.*
All updates related to the hash metadata will be first recorded in the
clustering plan, and then be reflected in partitions' hashing metadata when
clustering finishes. The plan is generated and persisted in files during the
scheduling process, which is protected by a table-level lock for a consistent
table view.
{panel}
as described, I also check the codes in the master branch, it uses the first
policy which will produce uneven data files first, and buckets will finally
converge to a balanced distribution after multiple rounds of resizing
processes. but when I use this policy in the production env, I found it will
cause OOM issues very often, since compaction cannot compact very big files
with a huge amount of record keys, User also cannot read this MergeOnRead table
with uneven data files on Spark or Presto (Now Consistent hash index cannot use
on COW table)
is there any progress on the second policy?
was:
Let take a look at [Consistent Hash Index RFC: Bucket Resizing|
https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging]
I Copy Bucket resizing part as below:
{panel:title=*Bucket Resizing (Splitting & Merging)*}
Considering there is a semantic similarity between bucket resizing and
clustering (i.e., re-organizing small data files), this proposal plans to
integrate the resizing process as a subtask into the clustering service. The
trigger condition for resizing directly depends on the file size, where small
files will be merged and large files will be split.
For merging files, we require that the buckets should be adjacent to each other
in terms of their hash ranges so that the output bucket has only one continuous
hash range. Though it is not required in a standard Consistent Hashing
algorithm, fragmentations in hash ranges may cause extra complexity for the
splitting process in our case.
For splitting files, a split point (i.e., hash ranges for the output buckets)
should be decided:
* A simple policy would be split in the range middle, but it may produce uneven
data files. In an extreme case, splitting may produce one file with all data
and one file with no data.
* Another policy is to find a split point that evenly dispatches records into
children buckets. It requires knowledge about the hash value distribution of
the original buckets.
*In our implementation, we will first stick to the first simple one, as buckets
will finally converge to a balanced distribution after multiple rounds of
resizing processes. Of course, a pluggable implementation will be kept for
extensibility so that users can choose different available policies.*
All updates related to the hash metadata will be first recorded in the
clustering plan, and then be reflected in partitions' hashing metadata when
clustering finishes. The plan is generated and persisted in files during the
scheduling process, which is protected by a table-level lock for a consistent
table view.
{panel}
as described, I also check the codes in the master branch, it uses the first
policy which will produce uneven data files first, and buckets will finally
converge to a balanced distribution after multiple rounds of resizing
processes. but when I use this policy in the production env, I found it will
cause OOM issues very often, since compaction cannot compact very big files
with a huge amount of record keys, User also cannot read this MergeOnRead table
with uneven data files on Spark or Presto (Now Consistent hash index cannot use
on COW table)
> Make Consistent hash index Bucket Resizing more available on real cases
> ------------------------------------------------------------------------
>
> Key: HUDI-5560
> URL: https://issues.apache.org/jira/browse/HUDI-5560
> Project: Apache Hudi
> Issue Type: Improvement
> Components: index
> Reporter: Jian Feng
> Priority: Major
>
> Let take a look at [Consistent Hash Index RFC: Bucket Resizing|
> https://github.com/apache/hudi/blob/master/rfc/rfc-42/rfc-42.md#bucket-resizing-splitting--merging]
> I Copy Bucket resizing part as below:
> {panel:title=*Bucket Resizing (Splitting & Merging)*}
> Considering there is a semantic similarity between bucket resizing and
> clustering (i.e., re-organizing small data files), this proposal plans to
> integrate the resizing process as a subtask into the clustering service. The
> trigger condition for resizing directly depends on the file size, where small
> files will be merged and large files will be split.
> For merging files, we require that the buckets should be adjacent to each
> other in terms of their hash ranges so that the output bucket has only one
> continuous hash range. Though it is not required in a standard Consistent
> Hashing algorithm, fragmentations in hash ranges may cause extra complexity
> for the splitting process in our case.
> For splitting files, a split point (i.e., hash ranges for the output buckets)
> should be decided:
> * A simple policy would be split in the range middle, but it may produce
> uneven data files. In an extreme case, splitting may produce one file with
> all data and one file with no data.
> * Another policy is to find a split point that evenly dispatches records into
> children buckets. It requires knowledge about the hash value distribution of
> the original buckets.
> *In our implementation, we will first stick to the first simple one, as
> buckets will finally converge to a balanced distribution after multiple
> rounds of resizing processes. Of course, a pluggable implementation will be
> kept for extensibility so that users can choose different available policies.*
> All updates related to the hash metadata will be first recorded in the
> clustering plan, and then be reflected in partitions' hashing metadata when
> clustering finishes. The plan is generated and persisted in files during the
> scheduling process, which is protected by a table-level lock for a consistent
> table view.
> {panel}
> as described, I also check the codes in the master branch, it uses the first
> policy which will produce uneven data files first, and buckets will finally
> converge to a balanced distribution after multiple rounds of resizing
> processes. but when I use this policy in the production env, I found it will
> cause OOM issues very often, since compaction cannot compact very big files
> with a huge amount of record keys, User also cannot read this MergeOnRead
> table with uneven data files on Spark or Presto (Now Consistent hash index
> cannot use on COW table)
> is there any progress on the second policy?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)