Hi, I like the idea of specifying the maximum acceptable size of the bloom filter bit vector. I think it would be much better than specifying the expected number of distinct values (which we can not expect from the API consumer in my opinion). The desired false positives probability could still be specified and as a last step before writing the bloom filter to the file it could be checked, dropping the filter if it does not fulfill the check.
A bit of brainstorming (just some ideas that may or may not be useful): One more thing to consider is whether some smart encoding of the bit vector would help saving space. I expect the entropy of a nearly empty or nearly full bloom filter to be relatively low, because they consist mostly of zeroes/ones (respectively). For example, RLE encoding could take advantage of such a pattern (but should only be used when it saves space, because under regular circumstances the entropy will be high and RLE will only increase the data size). Alternatively, multiple bloom filters may be built at the same time until it becomes obvious which one matches the data characteristics best. The downside of this the increased number of hash calculations. This downside could be worked around by only building a large bloom filter and "folding the bit vector onto itself multiple times, cutting at the desired size boundary". For example, if we build a bloom filter of 512 bits but in the end we see that 64 bits would have been enough, we can split the bit vector into 8 equal chunks and XOR them together. The resulting bit vector can still function as a bloom filter by applying a modulo 64 operation to the hashes during lookup. Its efficiency may be worse though than if we used hash functions that directly map onto to 64 bits. Br, Zoltan On Thu, Jan 17, 2019 at 2:56 PM 俊杰陈 <cjjnj...@gmail.com> wrote: > Hi Parquet Developers > > In the bloom filter design doc we have discussed and determined bloom > filter definition, now I'd like to invite you to discuss how to build a > bloom filter in parquet. > > In my current implementation, a bloom filter is created first according to > specified number of distinct values and false positives probability, then > it is updated when column writer writing values. This way needs user to > estimate the column's NDV in a row group, however it is usually hard to get > this information for end users, especially, they don't have the row group > size info. So that the created bloom filter neither match the expected FPP > nor fit into size requirements. Though I could provide extra parameters > such as max bloom filter size to avoid wasting space, I think it still can > be improved. > > So I think following three things need to be discussed at first. > > 1. What parameters/configurations should we present to end user? > In my mind, a better way is that users specify the column names and max > sizes they are willing to use to build the bloom filter. Parquet takes > responsibility to calculate the NDV and create the bloom filter. > > 2. How to calculate the NDV at run time? > I tried to allocate a set to store all hash values for a column chunk and > then update bloom filter bitset at once when flushing row group. Not sure > whether it will cause some potential memory issue or not? > > 3. When to update bloom filter? > When writing values in column writer? or When flushing row group? If we use > the set to store distinct hash values, we can update when flushing row > group. > > There should be more things to be caring about except above three. Really > appreciate if you can provide any opinion or other thing you think need to > raise out. > > -- > Thanks & Best Regards >