Thanks for the suggestion, Andrew!

I'm sceptical about the conclusion in the blog post TBH, as the
measurements just seem to defy bloom filter theory or maybe I do not
understand the blog post correctly. I think it might be a problem with the
benchmark setup (details see below).

The blog post rightfully claims that bloom filters can get pretty big (2MB
for 1M NDV) if your NDV count (per row group!) is high and you still want a
reasonable false positive probability (FPP; say, 1%), but that is just the
way bloom filters are. The FPP formula is theoretically proven.

But now the blog post argues that you can still get pretty good results
even by using a bloom filter that is orders of magnitude smaller than that.
That is in stark contradiction with the FPP formula. The blog post claims
that just using 8KB instead of 2MB works just fine as well. But that
doesn't make sense, theory wise, unless the FPP formula is wrong (which it
isn't; it's proven). So we got quite a conundrum here; or maybe I
misunderstand the claims of the blog post.

Let me tease apart how I understand the blog post and where I disagree.
Maybe we find where the misunderstanding lies:

Side comment: First, the blog post talks about "varying the NDV parameter".
However, NDV isn't a parameter that one can vary freely. It is supposed to
be the *true* NDV count in the data set. That is, for Parquet it should be
the true NDV count in the row group for which the bloom filter is built.
Nothing to vary here.


The only parameter that *can* be tuned for a given data set is the FPP, as
NDV is an inherent property of the data set, not a tuning knob. By
"varying" the NDV as the blog post does, it in reality actually just varies
the FPP. You no longer should get the desired FPP, if you just input a
wrong NDV count into the sizing formula. If you want a smaller bloom
filter, you should raise the FPP. But that's just theory; in the end, you
pick a size for your bloom filter and it doesn't matter what parameter you
changed in your sizing formula to arrive at that value. It just irked me
that a non-variable parameter is varied.


Now, if I understand the claims of the blog post correctly, the blog post
basically says that if you underestimate the NDV count heavily, you still
get great pruning results. And this just contradicts the FPP formula. As
just elaborated, by underestimating the NDV count, you basically should get
a higher FPP, which should reduce the number of row groups you can prune.
But the measurement seems to hint that you can still prune perfectly. So
basically, the blog post empirically finds that the FFP formula for bloom
filters is wrong (or do I misunderstand the blog post here?).

*Now comes the important point and I think here is where the problem lies:*
The blog post doesn't say by how much the NDV is underestimated though. It
mentions the true NDV count of the whole data set (called cardinality in
the blog post), but it does not mention what that means for the true NDV
count *per row group*. The latter is dependent on the sorting of the data.
Is it sorted? Or is it random? In the former case, you will have way fewer
NDVs per row group than in the latter case (around 100x fewer in the 1M
cardinality case).

*Let's assume the worst case *(close to random ordering of rows): For the
1M cardinality example, we would really also get a true 1M NDV count in
each row group, so basically only unique values. In this case, per the FPP
formula, if you only use a 8kb bloom filter, you should get the following
FPP:
[image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in bits),
we get basically 1.000, so 100% false positive probability. And that also
matches intuition: Each NDV sets 8 bits, and since you only have 64'000
bits to set, but you're setting 8 million bits, each bit ends up being 1
with almost certain probability, so you end up with a useless bloom filter
of only 1 bits.

*Let's assume the best case:* If the values are sorted and we have 1
million NDV count for 100 million values, that means we have 10'000 NDV
count in each row group (Each value duplicated 100 times). For this case,
the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, m =
64'000) .

For this case, the bloom filter is definitely usable. However, for this
case, the default Parquet sizing formula would have also not yielded 2MB,
but rather 12kb, so if this is the case that your benchmark measures then
your values are pretty much in agreement with the Parquet sizing formula
and there is just no discrepancy.

I guess your experiment actually measures the best case, as then your
numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% row
groups are filtered, while with 16kb (0.2% FPP) all row groups are
correctly pruned. But this result just confirms that the sizing formula
that Parquet is correct.

So could it be that the blog post accidentally measures an NDV count of
10'000 while the intuition is given that we're dealing with an NDV count of
1'000'000 in the 1'000'000 cardinality case?

Cheers
Jan


Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield <
emkornfi...@gmail.com>:

> I think the table is useful, I think there are calculators online that do
> this pretty easily but putting it into the docs might allow at least some
> users to avoid unpleasant surprises.  In terms of generalizing to smaller
> NDV counts and there effectiveness we might just want to state the result
> but provide strong caveats to benchmark with their own data?
>
> Cheers,
> Micah
>
>
> On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
>
> > When we first looked into Parquet bloom filters[1] it was hard to
> > understand how effective they would be for a given amount of space
> > overhead.
> >
> > When we plugged our data's cardinality into the target ndv and fpp
> > parameters, it implied 2MB bloom filters *per column* per row group which
> > was unacceptable.
> >
> > However, when we did empirical tests (see Blog[2] from Trevor Hilton), we
> > found 2K-8K worked quite well.
> >
> > Is there any interest in porting some of the information from the blog
> into
> > the spec (specifically the tables of size based of fpp/ndv)? Or is this
> > better as a third-party resource / exercise for the reader?
> >
> > Andrew
> >
> > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/
> > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/
> >
>

Reply via email to