Thanks for the new feature, very promising ~ Some confusion about the *Scalability* and *Data Skew* part:
How do we expanded the number of existing buckets, say if we have 100 buckets before, but 120 buckets now, what is the algorithm ? About the data skew, did you mean there is no good solution to solve this problem now ? Best, Danny Chan 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道: > Hi, > Currently, Hudi index implementation is pluggable and provides two > options: bloom filter and hbase. When a Hudi table becomes large, the > performance of bloom filter degrade drastically due to the increase in > false positive probability. > > Hash index is an efficient light-weight approach to address the > performance issue. It is used in Hive called Bucket, which clusters the > records whose key have the same hash value under a unique hash function. > This pre-distribution can accelerate the sql query in some scenarios. > Besides, Bucket in Hive offers the efficient sampling. > > I make a RFC for this > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index. > > Feel free to discuss under this thread and suggestions are welcomed. > > Regards, > Shawy