Fully agree, let's get this into the spec. I am a big fan of specifications also containing side information instead of just being dry standardese and requiring implementers to figure out the ramifications themselves.
If you suggest a PR, I'm happy to give you a review. Another way might be to cap the bloom filter size up front (e.g. "I am > willing to spend no more than 8KB overhead per column per row group") and > accept the fact that the data may not permit a useful bloom filter with > that budget. Indeed, this also works and is probably a good idea for writers that for some reason cannot do two passes over the data. Two comments regarding this: - When writing such a size-capped bloom filter, writers should check the bloom filter quality after the bloom filter is built (no implementation does that so far AFAIK, but for the spec, we should write what an implementation should aspire to do). The nice thing about bloom filters is that you can check their quality by counting the 1 bits (quite fast with popcount instruction) or by maintaining a sketch as already described. - If the bloom filter is too big, then it will have very few 1s. This is not a problem, the bloom filter can still be quite useful in this case, as it will have almost no false positive; it just wastes some space, which is okay. - If the bloom filter is too small, then it will have a lot of 1s. Such a bloom filter can have a false positive rate that is basically 100%. As using a bloom filter comes with opportunity cost (loading the bloom filter from storage, probing the bloom filter incurring a cache miss), such a bloom filter will actually make readers smaller. Thus, such a bloom filter should be discarded and not written to the Parquet file. This is important, or otherwise you will see performance regressions when enabling bloom filters. - I would have to dig in some papers again to get a formula at what 1-bit percentage a bloom filter can be considered trash. I'll update once I have it. - Sadly, mostly unique columns--whose bloom filter requires a lot of memory as discussed--will likely always run into the space limit and therefore never get a useful bloom filter. This is a shame, as some important use cases would actually benefit from one. The most prominent case is when using UUIDs or similar random ids that do not show clustering as a (pseudo) key. Since the column is mostly unique, a successful bloom filter could always filter out all but one row group. But we need a big one to do this. That's just a theoretical restriction of bloom filters and there is nothing we can do in this case. But this should be mentioned in order to understand the ramifications of size-capping a bloom filter, so that users understand that they will never get good results for (mostly) unique columns unless they are willing to spend a considerable amount of memory. Cheers, Jan Am Di., 4. Juni 2024 um 12:36 Uhr schrieb Andrew Lamb < andrewlam...@gmail.com>: > So it seems there are at least three ways forward: > 1. Leave this email chain in the archives (nothing more) > 2. Incorporate some of the content of these emails into the spec as a > "Background" and "Recommendation" section > 3. Write a separate blog post / other content > > I would personally be inclined for #2 (and if the Jan and the maintainers > agree I could take a shot at drafting it) > > Thank you, > Andrew > > On Mon, Jun 3, 2024 at 9:33 AM Jan Finis <jpfi...@gmail.com> wrote: > > > Thanks a lot for the answer, that clarifies it for me. > > > > Let me elaborate on this a bit longer. Bloom filters are a topic where a > > lot of theory is necessary to make them work effectively and the fact > that > > most implementations today give the hard part (getting the NDV count > right) > > to the client code makes it hard to use them effectively and easy to make > > bloom filters that won't work or will waste space. > > > > Given that the spec is also meant as help for implementations on getting > > this right, I agree that we should put quite some more theoretical basics > > into the spec, so not everyone has to reinvent the wheel and find this > out > > for themselves. > > > > The blog is referring to the Rust parquet writer settings [1] which do > let > > > you vary the NDV parameter used to create the Bloom filters. There is > > > nothing in the spec to my knowledge that requires the NDV used to match > > the > > > actual NDV in the data. > > > > > > Got it! That's where my disconnect is stemming from then. I was arguing > > from the theory side, while your experience is from the rust > > implementation. > > > > I would argue that it is implied in the spec that the NDV count is the > real > > NDV count, as the spec only talks about theory, not about implementation > > parameters. The spec does not talk about how to obtain such an NDV count > > though. The fact that this is a parameter in the implementation is > probably > > an artifact of the implementation not tracking the NDV count itself and > > therefore expecting the user to track it and input it as a parameter. I > > would still argue that the chosen NDV count should match the real one; > > otherwise your FPP will be totally off. On the other hand, if the NDV > count > > is correct, you are basically guaranteed to receive the FPP you're > > expecting. > > > > One reason we did an empirical study was to see how to match the theory > to > > > practice. All the code used is open source [1] and we would welcome > > either > > > corrections to the code or help understanding why your theory based > > > argument doesn't seem to match what we measured in practice. > > > > > > As we found out together, I guess the code itself is fine. We might have > > just drawn imprecise conclusions from it. It was argued that Parquet > would > > suggest a ~2MB bloom filter in the 1M cardinality case, but as it turned > > out that the 1M cardinality case was likely just a 10'000 NDV count per > row > > group case, so also the Parquet spec would rather suggest a 12kB bloom > > filter. As the data in the benchmark showed clustering, the assumption > > "cardinality ~= row group NDV count" was incorrect and all further > > discrepancy followed from that. > > > > > > > 2. We were trying to find settings that would permit us to effectively > > > prune row groups without having to know the exact shape of the data up > > > front. > > > > > > > Let me put on my database theorist's hat here and be frank: This is > > impossible in general. NDV counts can be anything from 1 to millions, so > > trying to find a ballpark estimate that always works is just prone to > fail. > > Sizing a bloom filter while not knowing the NDV count will almost > certainly > > lead to a bloom filter that is either way too big (wasting space) or way > > too small (being useless due to false positives). A badly sized bloom > > filter has severe disadvantages, as readers will use it, wasting I/O and > > potentially doing an expensive check that might be useless due to a too > > small bloom filter that just always returns true. There is so much to > lose > > that you're probably better off not writing a bloom filter at all in this > > case. A bloom filter is a data structure that just needs knowledge of the > > (real) NDV count to be effective. > > > > And as you see in the table in the Parquet spec, bloom filter quality is > > *very* dependent on good NDV estimates. E.g., the difference between 10% > > FPP and 0.1% FPP is only a factor of 2.81. So, if you get your NDV > > guestimate wrong by just 3x, you might get a bloom filter with 10% FPP, > > even though you expected one with 0.1% FPP. Or with the number of the > > benchmark: The bloom filter for the 1M case was good with NDV count 5000 > > (91% pruned) but was mostly useless with 1000 (5% pruned). > > > > So if you want to use bloom filters and expect them to be effective, > there > > is no way around having a good NDV estimate. I would argue that all > Parquet > > writer implementations should track this through sketches, but I see that > > the open source implementations do not do this yet. This should be > somewhat > > straightforward to implement (as there are open source sketch > > implementations available [1][2]). > > > > As long as your Parquet writer implementation of choice doesn't do this, > I > > would advise to do this yourself in client code. Pick a number of rows > that > > is roughly your row group size and compute and NDV estimate through a > > sketch (this should be orders of magnitude cheaper than writing the > Parquet > > file, so it shouldn't matter too much performance wise). Then input this > > estimate as NDV parameter into the implementation. > > > > The only disadvantage to mention is that when using sketches to estimate > > bloom filter size, you need two passes over the data per row group. One > to > > estimate the bloom filter size, and then one to fill the bloom filter. > > However, given that you otherwise risk the bloom filter being > ineffective, > > this has to be worth it. > > > > With this common understanding, I would like to draw a conclusion. Things > > along the lines of this should probably also be mentioned in the spec. > > > > > > - Bloom filters are amazing if you got the NDV count right; otherwise > > they quickly deteriorate into space waste or very high false positive > > probability. > > - In the case of Parquet bloom filters, what matters is the NDV count > in > > the respective row group, not the NDV count / cardinality of your data > > set. > > The two can be equal if you have no clustering or ordering, but a lot > of > > data in the real world is clustered, so don't expect that you can > infer > > one > > from the other. > > - Sketches can help you to compute an NDV count that is good enough > > basically for free (compared to the cost of creating the bloom > filter), > > but > > the writer will need to perform two passes over each row group. > > > > > > And finally, one thing that the blog didn't measure: Bloom filters can > > get *very > > big*, if the NDV counts of your row group is close to the number of > values > > (i.e., the column is mostly unique). Since the blog only went up to an > NDV > > count of ~10'000, it didn't measure the case where we have a (true) NDV > > count of 1'000'000 for a 1'000'000 row group. > > > > As the table in the spec mentions, we need 16.2 bit (roughly 2 byte) per > > unique value if we want 0.1% FPP. So let's say we have a unique 4 byte > > integer column. Then the values are 4 byte, while the bloom filter is 2 > > byte per value, so the bloom filter overhead is 33%. But let's say we > have > > a pattern in the data that allows us to encode them more efficiently. > > > > For example, say that the values are mostly incrementing with some gaps, > so > > DELTA_BINARY_PACKED encoding can be used to mostly store 4 bit deltas. > > Suddenly, each encoded value might be ~0.5 bytes, but the bloom filter > > cannot use this compression, so it is still 2 bytes per value. Boom, we > > have just created a bloom filter that is 4 times larger than the data > > itself, ouch! > > > > So size-wise, the conclusion should be: > > > > - Bloom filters are small if (true) NDV count (per row group) << row > > count (per row group). Once the NDV count approaches the row count, a > > bloom > > filter's size can become as big as the data itself or even many times > > larger than the data itself. > > > > > > Sorry for elaborating so long here. This topic is quite dear to me, as I > > also needed to learn all of this the hard way. Our company also wanted to > > use bloom filters and we were discussing for so long why they didn't work > > as expected, so I needed to work myself into all that theory. And I feel > > that this should just be in the spec, so everyone can read up on it > without > > having to derive it themselves. > > > > Cheers, > > Jan > > > > [1] phttps://docs.rs/hyperloglogplus/latest/hyperloglogplus/ > > [2] https://datasketches.apache.org/docs/Community/Downloads.html > > > > > > Am Mo., 3. Juni 2024 um 13:45 Uhr schrieb Andrew Lamb < > > andrewlam...@gmail.com>: > > > > > Thanks Jan -- interesting discussion > > > > > > Some points: > > > 1. We used the Rust implementation of parquet (not parquet-java). > > > 2. We were trying to find settings that would permit us to effectively > > > prune row groups without having to know the exact shape of the data up > > > front. > > > 3. The blog is an attempt to help others figure out how to configure > > > parquet writers to get effective bloom filter pruning. > > > > > > > The blog post claims that just using 8KB instead of 2MB works just > fine > > > as well. But that > > > doesn't make sense, theory wise, unless the FPP formula is wrong (which > > it > > > isn't; it's proven). > > > > > > One reason we did an empirical study was to see how to match the theory > > to > > > practice. All the code used is open source [1] and we would welcome > > either > > > corrections to the code or help understanding why your theory based > > > argument doesn't seem to match what we measured in practice. > > > > > > > Side comment: First, the blog post talks about "varying the NDV > > > parameter". > > > However, NDV isn't a parameter that one can vary freely. It is supposed > > to > > > be the *true* NDV count in the data set. That is, for Parquet it should > > be > > > the true NDV count in the row group for which the bloom filter is > built. > > > Nothing to vary here. > > > > > > The blog is referring to the Rust parquet writer settings [1] which do > > let > > > you vary the NDV parameter used to create the Bloom filters. There is > > > nothing in the spec to my knowledge that requires the NDV used to match > > the > > > actual NDV in the data. > > > > > > > I guess your experiment actually measures the best case, as then your > > > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% > > row > > > groups are filtered, while with 16kb (0.2% FPP) all row groups are > > > correctly pruned. But this result just confirms that the sizing formula > > > that Parquet is correct. > > > > > > This seems like an excellent conclusion to me. I don't think we meant > to > > > imply there is anything incorrect with the spec. > > > > > > Andrew > > > > > > > > > [1] https://github.com/influxdata/parquet-bloom-filter-analysis > > > > > > > > > > > [2] > > > > > > > > > https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_bloom_filter_ndv > > > > > > > > > > > > > Now, if I understand the claims of the blog post correctly, the blog > > post > > > > basically says that if you underestimate the NDV count heavily, you > > still > > > > get great pruning results. And this just contradicts the FPP formula. > > As > > > > just elaborated, by underestimating the NDV count, you basically > should > > > get > > > > a higher FPP, which should reduce the number of row groups you can > > prune. > > > > But the measurement seems to hint that you can still prune perfectly. > > So > > > > basically, the blog post empirically finds that the FFP formula for > > bloom > > > > filters is wrong (or do I misunderstand the blog post here?). > > > > > > > > *Now comes the important point and I think here is where the problem > > > lies:* > > > > The blog post doesn't say by how much the NDV is underestimated > though. > > > It > > > > mentions the true NDV count of the whole data set (called cardinality > > in > > > > the blog post), but it does not mention what that means for the true > > NDV > > > > count *per row group*. The latter is dependent on the sorting of the > > > data. > > > > Is it sorted? Or is it random? In the former case, you will have way > > > fewer > > > > NDVs per row group than in the latter case (around 100x fewer in the > 1M > > > > cardinality case). > > > > > > > > *Let's assume the worst case *(close to random ordering of rows): For > > the > > > > 1M cardinality example, we would really also get a true 1M NDV count > in > > > > each row group, so basically only unique values. In this case, per > the > > > FPP > > > > formula, if you only use a 8kb bloom filter, you should get the > > following > > > > FPP: > > > > [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in > > bits), > > > > we get basically 1.000, so 100% false positive probability. And that > > also > > > > matches intuition: Each NDV sets 8 bits, and since you only have > 64'000 > > > > bits to set, but you're setting 8 million bits, each bit ends up > being > > 1 > > > > with almost certain probability, so you end up with a useless bloom > > > filter > > > > of only 1 bits. > > > > > > > > *Let's assume the best case:* If the values are sorted and we have 1 > > > > million NDV count for 100 million values, that means we have 10'000 > NDV > > > > count in each row group (Each value duplicated 100 times). For this > > case, > > > > the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, > > m = > > > > 64'000) . > > > > > > > > For this case, the bloom filter is definitely usable. However, for > this > > > > case, the default Parquet sizing formula would have also not yielded > > 2MB, > > > > but rather 12kb, so if this is the case that your benchmark measures > > then > > > > your values are pretty much in agreement with the Parquet sizing > > formula > > > > and there is just no discrepancy. > > > > > > > > I guess your experiment actually measures the best case, as then your > > > > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over > 90% > > > row > > > > groups are filtered, while with 16kb (0.2% FPP) all row groups are > > > > correctly pruned. But this result just confirms that the sizing > formula > > > > that Parquet is correct. > > > > > > > > So could it be that the blog post accidentally measures an NDV count > of > > > > 10'000 while the intuition is given that we're dealing with an NDV > > count > > > of > > > > 1'000'000 in the 1'000'000 cardinality case? > > > > > > > > Cheers > > > > Jan > > > > > > > > > > > > Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield < > > > > emkornfi...@gmail.com>: > > > > > > > > > I think the table is useful, I think there are calculators online > > that > > > do > > > > > this pretty easily but putting it into the docs might allow at > least > > > some > > > > > users to avoid unpleasant surprises. In terms of generalizing to > > > smaller > > > > > NDV counts and there effectiveness we might just want to state the > > > result > > > > > but provide strong caveats to benchmark with their own data? > > > > > > > > > > Cheers, > > > > > Micah > > > > > > > > > > > > > > > On Fri, May 31, 2024 at 1:59 PM Andrew Lamb < > andrewlam...@gmail.com> > > > > > wrote: > > > > > > > > > > > When we first looked into Parquet bloom filters[1] it was hard to > > > > > > understand how effective they would be for a given amount of > space > > > > > > overhead. > > > > > > > > > > > > When we plugged our data's cardinality into the target ndv and > fpp > > > > > > parameters, it implied 2MB bloom filters *per column* per row > group > > > > which > > > > > > was unacceptable. > > > > > > > > > > > > However, when we did empirical tests (see Blog[2] from Trevor > > > Hilton), > > > > we > > > > > > found 2K-8K worked quite well. > > > > > > > > > > > > Is there any interest in porting some of the information from the > > > blog > > > > > into > > > > > > the spec (specifically the tables of size based of fpp/ndv)? Or > is > > > this > > > > > > better as a third-party resource / exercise for the reader? > > > > > > > > > > > > Andrew > > > > > > > > > > > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/ > > > > > > [2]: > https://www.influxdata.com/blog/using-parquets-bloom-filters/ > > > > > > > > > > > > > > > > > > > > >