Thanks a lot for the answer, that clarifies it for me.

Let me elaborate on this a bit longer. Bloom filters are a topic where a
lot of theory is necessary to make them work effectively and the fact that
most implementations today give the hard part (getting the NDV count right)
to the client code makes it hard to use them effectively and easy to make
bloom filters that won't work or will waste space.

Given that the spec is also meant as help for implementations on getting
this right, I agree that we should put quite some more theoretical basics
into the spec, so not everyone has to reinvent the wheel and find this out
for themselves.

The blog is referring to the Rust parquet writer  settings [1] which do let
> you vary the NDV parameter used to create the Bloom filters. There is
> nothing in the spec to my knowledge that requires the NDV used to match the
> actual NDV in the data.


Got it! That's where my disconnect is stemming from then. I was arguing
from the theory side, while your experience is from the rust implementation.

I would argue that it is implied in the spec that the NDV count is the real
NDV count, as the spec only talks about theory, not about implementation
parameters. The spec does not talk about how to obtain such an NDV count
though. The fact that this is a parameter in the implementation is probably
an artifact of the implementation not tracking the NDV count itself and
therefore expecting the user to track it and input it as a parameter. I
would still argue that the chosen NDV count should match the real one;
otherwise your FPP will be totally off. On the other hand, if the NDV count
is correct, you are basically guaranteed to receive the FPP you're
expecting.

One reason we did an empirical study was to see how to match the theory to
> practice. All the code used is open source [1] and we would welcome either
> corrections to the code or help understanding why your theory based
> argument doesn't seem to match what we measured in practice.


As we found out together, I guess the code itself is fine. We might have
just drawn imprecise conclusions from it. It was argued that Parquet would
suggest a ~2MB bloom filter in the 1M cardinality case, but as it turned
out that the 1M cardinality case was likely just a 10'000 NDV count per row
group case, so also the Parquet spec would rather suggest a 12kB bloom
filter. As the data in the benchmark showed clustering, the assumption
"cardinality ~= row group NDV count" was incorrect and all further
discrepancy followed from that.


> 2. We were trying to find settings that would permit us to effectively
> prune row groups without having to know the exact shape of the data up
> front.
>

Let me put on my database theorist's hat here and be frank: This is
impossible in general. NDV counts can be anything from 1 to millions, so
trying to find a ballpark estimate that always works is just prone to fail.
Sizing a bloom filter while not knowing the NDV count will almost certainly
lead to a bloom filter that is either way too big (wasting space) or way
too small (being useless due to false positives). A badly sized bloom
filter has severe disadvantages, as readers will use it, wasting I/O and
potentially doing an expensive check that might be useless due to a too
small bloom filter that just always returns true. There is so much to lose
that you're probably better off not writing a bloom filter at all in this
case. A bloom filter is a data structure that just needs knowledge of the
(real) NDV count to be effective.

And as you see in the table in the Parquet spec, bloom filter quality is
*very* dependent on good NDV estimates. E.g., the difference between 10%
FPP and 0.1% FPP is only a factor of 2.81. So, if you get your NDV
guestimate wrong by just 3x, you might get a bloom filter with 10% FPP,
even though you expected one with 0.1% FPP. Or with the number of the
benchmark: The bloom filter for the 1M case was good with NDV count 5000
(91% pruned) but was mostly useless with 1000 (5% pruned).

So if you want to use bloom filters and expect them to be effective, there
is no way around having a good NDV estimate. I would argue that all Parquet
writer implementations should track this through sketches, but I see that
the open source implementations do not do this yet. This should be somewhat
straightforward to implement (as there are open source sketch
implementations available [1][2]).

As long as your Parquet writer implementation of choice doesn't do this, I
would advise to do this yourself in client code. Pick a number of rows that
is roughly your row group size and compute and NDV estimate through a
sketch (this should be orders of magnitude cheaper than writing the Parquet
file, so it shouldn't matter too much performance wise). Then input this
estimate as NDV parameter into the implementation.

The only disadvantage to mention is that when using sketches to estimate
bloom filter size, you need two passes over the data per row group. One to
estimate the bloom filter size, and then one to fill the bloom filter.
However, given that you otherwise risk the bloom filter being ineffective,
this has to be worth it.

With this common understanding, I would like to draw a conclusion. Things
along the lines of this should probably also be mentioned in the spec.


   - Bloom filters are amazing if you got the NDV count right; otherwise
   they quickly deteriorate into space waste or very high false positive
   probability.
   - In the case of Parquet bloom filters, what matters is the NDV count in
   the respective row group, not the NDV count / cardinality of your data set.
   The two can be equal if you have no clustering or ordering, but a lot of
   data in the real world is clustered, so don't expect that you can infer one
   from the other.
   - Sketches can help you to compute an NDV count that is good enough
   basically for free (compared to the cost of creating the bloom filter), but
   the writer will need to perform two passes over each row group.


And finally, one thing that the blog didn't measure: Bloom filters can
get *very
big*, if the NDV counts of your row group is close to the number of values
(i.e., the column is mostly unique). Since the blog only went up to an NDV
count of ~10'000, it didn't measure the case where we have a (true) NDV
count of 1'000'000 for a 1'000'000 row group.

As the table in the spec mentions, we need 16.2 bit (roughly 2 byte) per
unique value if we want 0.1% FPP. So let's say we have a unique 4 byte
integer column. Then the values are 4 byte, while the bloom filter is 2
byte per value, so the bloom filter overhead is 33%. But let's say we have
a pattern in the data that allows us to encode them more efficiently.

For example, say that the values are mostly incrementing with some gaps, so
DELTA_BINARY_PACKED encoding can be used to mostly store 4 bit deltas.
Suddenly, each encoded value might be ~0.5 bytes, but the bloom filter
cannot use this compression, so it is still 2 bytes per value. Boom, we
have just created a bloom filter that is 4 times larger than the data
itself, ouch!

So size-wise, the conclusion should be:

   - Bloom filters are small if (true) NDV count (per row group) << row
   count (per row group). Once the NDV count approaches the row count, a bloom
   filter's size can become as big as the data itself or even many times
   larger than the data itself.


Sorry for elaborating so long here. This topic is quite dear to me, as I
also needed to learn all of this the hard way. Our company also wanted to
use bloom filters and we were discussing for so long why they didn't work
as expected, so I needed to work myself into all that theory. And I feel
that this should just be in the spec, so everyone can read up on it without
having to derive it themselves.

Cheers,
Jan

[1] phttps://docs.rs/hyperloglogplus/latest/hyperloglogplus/
[2] https://datasketches.apache.org/docs/Community/Downloads.html


Am Mo., 3. Juni 2024 um 13:45 Uhr schrieb Andrew Lamb <
andrewlam...@gmail.com>:

> Thanks Jan -- interesting discussion
>
> Some points:
> 1. We used the Rust implementation of parquet (not parquet-java).
> 2. We were trying to find settings that would permit us to effectively
> prune row groups without having to know the exact shape of the data up
> front.
> 3. The blog is an attempt to help others figure out how to configure
> parquet writers to get effective bloom filter pruning.
>
> > The blog post claims that just using 8KB instead of 2MB works just fine
> as well. But that
> doesn't make sense, theory wise, unless the FPP formula is wrong (which it
> isn't; it's proven).
>
> One reason we did an empirical study was to see how to match the theory to
> practice. All the code used is open source [1] and we would welcome either
> corrections to the code or help understanding why your theory based
> argument doesn't seem to match what we measured in practice.
>
> > Side comment: First, the blog post talks about "varying the NDV
> parameter".
> However, NDV isn't a parameter that one can vary freely. It is supposed to
> be the *true* NDV count in the data set. That is, for Parquet it should be
> the true NDV count in the row group for which the bloom filter is built.
> Nothing to vary here.
>
> The blog is referring to the Rust parquet writer  settings [1] which do let
> you vary the NDV parameter used to create the Bloom filters. There is
> nothing in the spec to my knowledge that requires the NDV used to match the
> actual NDV in the data.
>
> > I guess your experiment actually measures the best case, as then your
> numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% row
> groups are filtered, while with 16kb (0.2% FPP) all row groups are
> correctly pruned. But this result just confirms that the sizing formula
> that Parquet is correct.
>
> This seems like an excellent conclusion to me. I don't think we meant to
> imply there is anything incorrect with the spec.
>
> Andrew
>
>
> [1] https://github.com/influxdata/parquet-bloom-filter-analysis
>
> >
> > [2]
>
> https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_bloom_filter_ndv
>
>
>
> > Now, if I understand the claims of the blog post correctly, the blog post
> > basically says that if you underestimate the NDV count heavily, you still
> > get great pruning results. And this just contradicts the FPP formula. As
> > just elaborated, by underestimating the NDV count, you basically should
> get
> > a higher FPP, which should reduce the number of row groups you can prune.
> > But the measurement seems to hint that you can still prune perfectly. So
> > basically, the blog post empirically finds that the FFP formula for bloom
> > filters is wrong (or do I misunderstand the blog post here?).
> >
> > *Now comes the important point and I think here is where the problem
> lies:*
> > The blog post doesn't say by how much the NDV is underestimated though.
> It
> > mentions the true NDV count of the whole data set (called cardinality in
> > the blog post), but it does not mention what that means for the true NDV
> > count *per row group*. The latter is dependent on the sorting of the
> data.
> > Is it sorted? Or is it random? In the former case, you will have way
> fewer
> > NDVs per row group than in the latter case (around 100x fewer in the 1M
> > cardinality case).
> >
> > *Let's assume the worst case *(close to random ordering of rows): For the
> > 1M cardinality example, we would really also get a true 1M NDV count in
> > each row group, so basically only unique values. In this case, per the
> FPP
> > formula, if you only use a 8kb bloom filter, you should get the following
> > FPP:
> > [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in bits),
> > we get basically 1.000, so 100% false positive probability. And that also
> > matches intuition: Each NDV sets 8 bits, and since you only have 64'000
> > bits to set, but you're setting 8 million bits, each bit ends up being 1
> > with almost certain probability, so you end up with a useless bloom
> filter
> > of only 1 bits.
> >
> > *Let's assume the best case:* If the values are sorted and we have 1
> > million NDV count for 100 million values, that means we have 10'000 NDV
> > count in each row group (Each value duplicated 100 times). For this case,
> > the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, m =
> > 64'000) .
> >
> > For this case, the bloom filter is definitely usable. However, for this
> > case, the default Parquet sizing formula would have also not yielded 2MB,
> > but rather 12kb, so if this is the case that your benchmark measures then
> > your values are pretty much in agreement with the Parquet sizing formula
> > and there is just no discrepancy.
> >
> > I guess your experiment actually measures the best case, as then your
> > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90%
> row
> > groups are filtered, while with 16kb (0.2% FPP) all row groups are
> > correctly pruned. But this result just confirms that the sizing formula
> > that Parquet is correct.
> >
> > So could it be that the blog post accidentally measures an NDV count of
> > 10'000 while the intuition is given that we're dealing with an NDV count
> of
> > 1'000'000 in the 1'000'000 cardinality case?
> >
> > Cheers
> > Jan
> >
> >
> > Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield <
> > emkornfi...@gmail.com>:
> >
> > > I think the table is useful, I think there are calculators online that
> do
> > > this pretty easily but putting it into the docs might allow at least
> some
> > > users to avoid unpleasant surprises.  In terms of generalizing to
> smaller
> > > NDV counts and there effectiveness we might just want to state the
> result
> > > but provide strong caveats to benchmark with their own data?
> > >
> > > Cheers,
> > > Micah
> > >
> > >
> > > On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com>
> > > wrote:
> > >
> > > > When we first looked into Parquet bloom filters[1] it was hard to
> > > > understand how effective they would be for a given amount of space
> > > > overhead.
> > > >
> > > > When we plugged our data's cardinality into the target ndv and fpp
> > > > parameters, it implied 2MB bloom filters *per column* per row group
> > which
> > > > was unacceptable.
> > > >
> > > > However, when we did empirical tests (see Blog[2] from Trevor
> Hilton),
> > we
> > > > found 2K-8K worked quite well.
> > > >
> > > > Is there any interest in porting some of the information from the
> blog
> > > into
> > > > the spec (specifically the tables of size based of fpp/ndv)? Or is
> this
> > > > better as a third-party resource / exercise for the reader?
> > > >
> > > > Andrew
> > > >
> > > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/
> > > > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/
> > > >
> > >
> >
>

Reply via email to