Re: [DISCUSS] Hash Index for HUDI

Satish Kotha Wed, 02 Jun 2021 10:51:42 -0700

+1.   You may want to read this thread
<http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/%3CCADNkrp7MHbNH_s2Svyo%2B56xJu-v7knzgE9sed8MrWAXQC3LQCw%40mail.gmail.com%3E>
as well. There are minor differences between these threads, but the high
level idea is similar.


On Wed, Jun 2, 2021 at 7:42 AM 耿筱喻 <gengxiaoyu1...@gmail.com> wrote:

> Hi,
> Currently, Hudi index implementation is pluggable and provides two
> options: bloom filter and hbase. When a Hudi table becomes large, the
> performance of bloom filter degrade drastically due to the increase in
> false positive probability.
>
> Hash index is an efficient light-weight approach to address the
> performance issue. It is used in Hive called Bucket, which clusters the
> records whose key have the same hash value under a unique hash function.
> This pre-distribution can accelerate the sql query in some scenarios.
> Besides, Bucket in Hive offers the efficient sampling.
>
> I make a RFC for this
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.
>
> Feel free to discuss under this thread and suggestions are welcomed.
>
> Regards,
> Shawy

Re: [DISCUSS] Hash Index for HUDI

Reply via email to