Thanks for raising this, Junjie.

One more topic worth to add:
Which columns do we want to write bloom filters for? May it depend on the
type? Is bloom filter required if we have dictionary? Is bloom filter
required if the column is ordered and we have column indexes? (etc.)



On Thu, Jan 17, 2019 at 2:56 PM 俊杰陈 <cjjnj...@gmail.com> wrote:

> Hi Parquet Developers
>
> In the bloom filter design doc we have discussed and determined bloom
> filter definition, now I'd like to invite you to discuss how to build a
> bloom filter in parquet.
>
> In my current implementation, a bloom filter is created first according to
> specified number of distinct values and false positives probability,  then
> it is updated when column writer writing values. This way needs user to
> estimate the column's NDV in a row group, however it is usually hard to get
> this information for end users, especially, they don't have the row group
> size info. So that the created bloom filter neither match the expected FPP
> nor fit into size requirements. Though I could provide extra parameters
> such as max bloom filter size to avoid wasting space, I think it still can
> be improved.
>
> So I think following three things need to be discussed at first.
>
> 1. What parameters/configurations should we present to end user?
> In my mind, a better way is that users specify the column names and max
> sizes they are willing to use to build the bloom filter.  Parquet takes
> responsibility to calculate the NDV and create the bloom filter.
>
> 2. How to calculate the NDV at run time?
> I tried to allocate a set to store all hash values for a column chunk and
> then update bloom filter bitset at once when flushing row group. Not sure
> whether it will cause some potential memory issue or not?
>
> 3. When to update bloom filter?
> When writing values in column writer? or When flushing row group? If we use
> the set to store distinct hash values, we can update when flushing row
> group.
>
> There should be more things to be caring about except above three.  Really
> appreciate if you can provide any opinion or other thing you think need to
> raise out.
>
> --
> Thanks & Best Regards
>

Reply via email to