Hi Parquet Developers

In the bloom filter design doc we have discussed and determined bloom
filter definition, now I'd like to invite you to discuss how to build a
bloom filter in parquet.

In my current implementation, a bloom filter is created first according to
specified number of distinct values and false positives probability,  then
it is updated when column writer writing values. This way needs user to
estimate the column's NDV in a row group, however it is usually hard to get
this information for end users, especially, they don't have the row group
size info. So that the created bloom filter neither match the expected FPP
nor fit into size requirements. Though I could provide extra parameters
such as max bloom filter size to avoid wasting space, I think it still can
be improved.

So I think following three things need to be discussed at first.

1. What parameters/configurations should we present to end user?
In my mind, a better way is that users specify the column names and max
sizes they are willing to use to build the bloom filter.  Parquet takes
responsibility to calculate the NDV and create the bloom filter.

2. How to calculate the NDV at run time?
I tried to allocate a set to store all hash values for a column chunk and
then update bloom filter bitset at once when flushing row group. Not sure
whether it will cause some potential memory issue or not?

3. When to update bloom filter?
When writing values in column writer? or When flushing row group? If we use
the set to store distinct hash values, we can update when flushing row
group.

There should be more things to be caring about except above three.  Really
appreciate if you can provide any opinion or other thing you think need to
raise out.

-- 
Thanks & Best Regards

Reply via email to