I see that we already have a PR up. Will catch up on it and provide some initial comments. Thanks!
On Wed, Jun 16, 2021 at 9:02 AM Shawy Geng <gengxiaoyu1...@gmail.com> wrote: > Combining bucket index and bloom filter is a great idea. There is no > conflict between the two in implementation, and the bloom filter info can > be still stored in the file to position faster. > > Best, > Shawy > > > 2021年6月9日 16:23,Thiru Malai <thiru.dr...@gmail.com> 写道: > > > > Hi, > > > > This feature seems promising. If we are planning to assign the > filegroupID as the hash mod value, then we can leverage this change in > Bloom Index as well by pruning the files based on hash mod value before mix > max record_key pruning. So that the exploded RDD will be comparatively > smaller which will eventually optimise the shuffle size in "Compute all > comparisons needed between records and files" stages. > > > > Can we add this hash based indexing approach to Bloom Filter Based > approach also > > > > On 2021/06/07 03:26:34, Danny Chan <danny0...@apache.org> wrote: > >>> number of buckets expanded by multiple is recommended > >> The condition is too harsh and the bucket number would be with > >> exponential growth. > >> > >>> with hash index can be solved by using mutiple file groups per bucket > as > >> mentioned in the RFC > >> The relation of file groups and bucket would be too complicated, we > should > >> avoid that. It also requires that the query engine be aware of the > >> bucketing rules, not that transparent and is not a common query > >> optimization. > >> > >> Best, > >> Danny Chan > >> > >> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月4日周五 下午6:06写道: > >> > >>> Thank you for your questions. > >>> > >>> For the first question, the number of buckets expanded by mutiple is > >>> recommended. Combine rehashing and clustering to re-distribute the data > >>> without shuffling. For example, 2 buckets expands to 4 by splitting > the 1st > >>> bucket and rehashing data in it to two small buckets: 1st and 3st > bucket. > >>> Details have been supplied to the RFC. > >>> > >>> For the second one, data skew when writing to hudi with hash index can > be > >>> solved by using mutiple file groups per bucket as mentioned in the > RFC. To > >>> data process engine like Spark, data skew when table joining can be > solved > >>> by splitting the skew partition to some smaller units and distributing > them > >>> to different tasks to execute, and it works in some scenarios which has > >>> fixed sql pattern. Besides, data skew solution needs more effort to be > >>> compatible with bucket join rule. However, the read and write long tail > >>> caused by data skew in sql query is hard to be solved. > >>> > >>> Regards, > >>> Shawy > >>> > >>>> 2021年6月3日 10:47,Danny Chan <danny0...@apache.org> 写道: > >>>> > >>>> Thanks for the new feature, very promising ~ > >>>> > >>>> Some confusion about the *Scalability* and *Data Skew* part: > >>>> > >>>> How do we expanded the number of existing buckets, say if we have 100 > >>>> buckets before, but 120 buckets now, what is the algorithm ? > >>>> > >>>> About the data skew, did you mean there is no good solution to solve > this > >>>> problem now ? > >>>> > >>>> Best, > >>>> Danny Chan > >>>> > >>>> 耿筱喻 <gengxiaoyu1...@gmail.com> 于2021年6月2日周三 下午10:42写道: > >>>> > >>>>> Hi, > >>>>> Currently, Hudi index implementation is pluggable and provides two > >>>>> options: bloom filter and hbase. When a Hudi table becomes large, the > >>>>> performance of bloom filter degrade drastically due to the increase > in > >>>>> false positive probability. > >>>>> > >>>>> Hash index is an efficient light-weight approach to address the > >>>>> performance issue. It is used in Hive called Bucket, which clusters > the > >>>>> records whose key have the same hash value under a unique hash > function. > >>>>> This pre-distribution can accelerate the sql query in some scenarios. > >>>>> Besides, Bucket in Hive offers the efficient sampling. > >>>>> > >>>>> I make a RFC for this > >>>>> > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index > >>> . > >>>>> > >>>>> Feel free to discuss under this thread and suggestions are welcomed. > >>>>> > >>>>> Regards, > >>>>> Shawy > >>> > >>> > >> > >