Thanks Jan -- interesting discussion Some points: 1. We used the Rust implementation of parquet (not parquet-java). 2. We were trying to find settings that would permit us to effectively prune row groups without having to know the exact shape of the data up front. 3. The blog is an attempt to help others figure out how to configure parquet writers to get effective bloom filter pruning.
> The blog post claims that just using 8KB instead of 2MB works just fine as well. But that doesn't make sense, theory wise, unless the FPP formula is wrong (which it isn't; it's proven). One reason we did an empirical study was to see how to match the theory to practice. All the code used is open source [1] and we would welcome either corrections to the code or help understanding why your theory based argument doesn't seem to match what we measured in practice. > Side comment: First, the blog post talks about "varying the NDV parameter". However, NDV isn't a parameter that one can vary freely. It is supposed to be the *true* NDV count in the data set. That is, for Parquet it should be the true NDV count in the row group for which the bloom filter is built. Nothing to vary here. The blog is referring to the Rust parquet writer settings [1] which do let you vary the NDV parameter used to create the Bloom filters. There is nothing in the spec to my knowledge that requires the NDV used to match the actual NDV in the data. > I guess your experiment actually measures the best case, as then your numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% row groups are filtered, while with 16kb (0.2% FPP) all row groups are correctly pruned. But this result just confirms that the sizing formula that Parquet is correct. This seems like an excellent conclusion to me. I don't think we meant to imply there is anything incorrect with the spec. Andrew [1] https://github.com/influxdata/parquet-bloom-filter-analysis > > [2] https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_bloom_filter_ndv > Now, if I understand the claims of the blog post correctly, the blog post > basically says that if you underestimate the NDV count heavily, you still > get great pruning results. And this just contradicts the FPP formula. As > just elaborated, by underestimating the NDV count, you basically should get > a higher FPP, which should reduce the number of row groups you can prune. > But the measurement seems to hint that you can still prune perfectly. So > basically, the blog post empirically finds that the FFP formula for bloom > filters is wrong (or do I misunderstand the blog post here?). > > *Now comes the important point and I think here is where the problem lies:* > The blog post doesn't say by how much the NDV is underestimated though. It > mentions the true NDV count of the whole data set (called cardinality in > the blog post), but it does not mention what that means for the true NDV > count *per row group*. The latter is dependent on the sorting of the data. > Is it sorted? Or is it random? In the former case, you will have way fewer > NDVs per row group than in the latter case (around 100x fewer in the 1M > cardinality case). > > *Let's assume the worst case *(close to random ordering of rows): For the > 1M cardinality example, we would really also get a true 1M NDV count in > each row group, so basically only unique values. In this case, per the FPP > formula, if you only use a 8kb bloom filter, you should get the following > FPP: > [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in bits), > we get basically 1.000, so 100% false positive probability. And that also > matches intuition: Each NDV sets 8 bits, and since you only have 64'000 > bits to set, but you're setting 8 million bits, each bit ends up being 1 > with almost certain probability, so you end up with a useless bloom filter > of only 1 bits. > > *Let's assume the best case:* If the values are sorted and we have 1 > million NDV count for 100 million values, that means we have 10'000 NDV > count in each row group (Each value duplicated 100 times). For this case, > the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, m = > 64'000) . > > For this case, the bloom filter is definitely usable. However, for this > case, the default Parquet sizing formula would have also not yielded 2MB, > but rather 12kb, so if this is the case that your benchmark measures then > your values are pretty much in agreement with the Parquet sizing formula > and there is just no discrepancy. > > I guess your experiment actually measures the best case, as then your > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% row > groups are filtered, while with 16kb (0.2% FPP) all row groups are > correctly pruned. But this result just confirms that the sizing formula > that Parquet is correct. > > So could it be that the blog post accidentally measures an NDV count of > 10'000 while the intuition is given that we're dealing with an NDV count of > 1'000'000 in the 1'000'000 cardinality case? > > Cheers > Jan > > > Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield < > emkornfi...@gmail.com>: > > > I think the table is useful, I think there are calculators online that do > > this pretty easily but putting it into the docs might allow at least some > > users to avoid unpleasant surprises. In terms of generalizing to smaller > > NDV counts and there effectiveness we might just want to state the result > > but provide strong caveats to benchmark with their own data? > > > > Cheers, > > Micah > > > > > > On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com> > > wrote: > > > > > When we first looked into Parquet bloom filters[1] it was hard to > > > understand how effective they would be for a given amount of space > > > overhead. > > > > > > When we plugged our data's cardinality into the target ndv and fpp > > > parameters, it implied 2MB bloom filters *per column* per row group > which > > > was unacceptable. > > > > > > However, when we did empirical tests (see Blog[2] from Trevor Hilton), > we > > > found 2K-8K worked quite well. > > > > > > Is there any interest in porting some of the information from the blog > > into > > > the spec (specifically the tables of size based of fpp/ndv)? Or is this > > > better as a third-party resource / exercise for the reader? > > > > > > Andrew > > > > > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/ > > > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/ > > > > > >