Hi,

I like the idea of specifying the maximum acceptable size of the bloom
filter bit vector. I think it would be much better than specifying the
expected number of distinct values (which we can not expect from the API
consumer in my opinion). The desired false positives probability could
still be specified and as a last step before writing the bloom filter to
the file it could be checked, dropping the filter if it does not fulfill
the check.

A bit of brainstorming (just some ideas that may or may not be useful): One
more thing to consider is whether some smart encoding of the bit vector
would help saving space. I expect the entropy of a nearly empty or nearly
full bloom filter to be relatively low, because they consist mostly of
zeroes/ones (respectively). For example, RLE encoding could take advantage
of such a pattern (but should only be used when it saves space, because
under regular circumstances the entropy will be high and RLE will only
increase the data size).

Alternatively, multiple bloom filters may be built at the same time until
it becomes obvious which one matches the data characteristics best. The
downside of this the increased number of hash calculations. This downside
could be worked around by only building a large bloom filter and "folding
the bit vector onto itself multiple times, cutting at the desired size
boundary". For example, if we build a bloom filter of 512 bits but in the
end we see that 64 bits would have been enough, we can split the bit vector
into 8 equal chunks and XOR them together. The resulting bit vector can
still function as a bloom filter by applying a modulo 64 operation to the
hashes during lookup. Its efficiency may be worse though than if we used
hash functions that directly map onto to 64 bits.

Br,

Zoltan

On Thu, Jan 17, 2019 at 2:56 PM 俊杰陈 <cjjnj...@gmail.com> wrote:

> Hi Parquet Developers
>
> In the bloom filter design doc we have discussed and determined bloom
> filter definition, now I'd like to invite you to discuss how to build a
> bloom filter in parquet.
>
> In my current implementation, a bloom filter is created first according to
> specified number of distinct values and false positives probability,  then
> it is updated when column writer writing values. This way needs user to
> estimate the column's NDV in a row group, however it is usually hard to get
> this information for end users, especially, they don't have the row group
> size info. So that the created bloom filter neither match the expected FPP
> nor fit into size requirements. Though I could provide extra parameters
> such as max bloom filter size to avoid wasting space, I think it still can
> be improved.
>
> So I think following three things need to be discussed at first.
>
> 1. What parameters/configurations should we present to end user?
> In my mind, a better way is that users specify the column names and max
> sizes they are willing to use to build the bloom filter.  Parquet takes
> responsibility to calculate the NDV and create the bloom filter.
>
> 2. How to calculate the NDV at run time?
> I tried to allocate a set to store all hash values for a column chunk and
> then update bloom filter bitset at once when flushing row group. Not sure
> whether it will cause some potential memory issue or not?
>
> 3. When to update bloom filter?
> When writing values in column writer? or When flushing row group? If we use
> the set to store distinct hash values, we can update when flushing row
> group.
>
> There should be more things to be caring about except above three.  Really
> appreciate if you can provide any opinion or other thing you think need to
> raise out.
>
> --
> Thanks & Best Regards
>

Reply via email to