Re: [DISCUSS] Hash Index for HUDI

耿筱喻 Fri, 04 Jun 2021 03:05:49 -0700

Thank you for your questions.

For the first question, the number of buckets expanded by mutiple is 
recommended. Combine rehashing and clustering to re-distribute the data without 
shuffling. For example, 2 buckets expands to 4 by splitting the 1st bucket and 
rehashing data in it to two small buckets: 1st and 3st bucket. Details have 
been supplied to the RFC.


For the second one, data skew when writing to hudi with hash index can be 
solved by using mutiple file groups per bucket as mentioned in the RFC. To data 
process engine like Spark, data skew when table joining can be solved by 
splitting the skew partition to some smaller units and distributing them to 
different tasks to execute, and it works in some scenarios which has fixed sql 
pattern. Besides, data skew solution needs more effort to be compatible with 
bucket join rule. However, the read and write long tail caused by data skew in 
sql query is hard to be solved.

Regards,
Shawy

> 2021年6月3日 10:47，Danny Chan <[email protected]> 写道：
> 
> Thanks for the new feature, very promising ~
> 
> Some confusion about the *Scalability* and *Data Skew* part:
> 
> How do we expanded the number of existing buckets, say if we have 100
> buckets before, but 120 buckets now, what is the algorithm ？
> 
> About the data skew, did you mean there is no good solution to solve this
> problem now ?
> 
> Best,
> Danny Chan
> 
> 耿筱喻 <[email protected]> 于2021年6月2日周三 下午10:42写道：
> 
>> Hi,
>> Currently, Hudi index implementation is pluggable and provides two
>> options: bloom filter and hbase. When a Hudi table becomes large, the
>> performance of bloom filter degrade drastically due to the increase in
>> false positive probability.
>> 
>> Hash index is an efficient light-weight approach to address the
>> performance issue. It is used in Hive called Bucket, which clusters the
>> records whose key have the same hash value under a unique hash function.
>> This pre-distribution can accelerate the sql query in some scenarios.
>> Besides, Bucket in Hive offers the efficient sampling.
>> 
>> I make a RFC for this
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.
>> 
>> Feel free to discuss under this thread and suggestions are welcomed.
>> 
>> Regards,
>> Shawy

Re: [DISCUSS] Hash Index for HUDI

Reply via email to