Addendum:

I should mention the elephant in the room: How do we know the (true) NDV
count when writing a Parquet row group?

This is one thing that the parquet-java implementation currently doesn't do
well and that might also be one thing that should be mentioned in the spec:
If I read the code correctly, parquet-java doesn't know the NDV count when
writing a Parquet file. That is bad. Since it looks at the data anyway, it
should simply estimate the NDV count.

Our proprietary Parquet writer uses a HyperLogLog sketch to estimate the
NDV count up to some percents of uncertainty (and these percents don't
matter at all for bloom filter sizing). This way, we can always write a
bloom filter with the correct size for each row group, as we know the true
NDV count quite well, so the user really just has to input the desired FPP
and the writer will figure out the rest.

Is this maybe the reason why the blog post talks about the NDV being
tunable? Maybe that's the misunderstanding I have.

So I guess that we should add a comment into the spec that writers should
try to estimate (or even accurately compute, but that's expensive) the NDV
count to size the bloom filter correctly.

Cheers,
Jan

Am Mo., 3. Juni 2024 um 12:53 Uhr schrieb Jan Finis <jpfi...@gmail.com>:

> Thanks for the suggestion, Andrew!
>
> I'm sceptical about the conclusion in the blog post TBH, as the
> measurements just seem to defy bloom filter theory or maybe I do not
> understand the blog post correctly. I think it might be a problem with the
> benchmark setup (details see below).
>
> The blog post rightfully claims that bloom filters can get pretty big (2MB
> for 1M NDV) if your NDV count (per row group!) is high and you still want a
> reasonable false positive probability (FPP; say, 1%), but that is just the
> way bloom filters are. The FPP formula is theoretically proven.
>
> But now the blog post argues that you can still get pretty good results
> even by using a bloom filter that is orders of magnitude smaller than that.
> That is in stark contradiction with the FPP formula. The blog post claims
> that just using 8KB instead of 2MB works just fine as well. But that
> doesn't make sense, theory wise, unless the FPP formula is wrong (which it
> isn't; it's proven). So we got quite a conundrum here; or maybe I
> misunderstand the claims of the blog post.
>
> Let me tease apart how I understand the blog post and where I disagree.
> Maybe we find where the misunderstanding lies:
>
> Side comment: First, the blog post talks about "varying the NDV
> parameter". However, NDV isn't a parameter that one can vary freely. It is
> supposed to be the *true* NDV count in the data set. That is, for Parquet
> it should be the true NDV count in the row group for which the bloom filter
> is built. Nothing to vary here.
>
>
> The only parameter that *can* be tuned for a given data set is the FPP,
> as NDV is an inherent property of the data set, not a tuning knob. By
> "varying" the NDV as the blog post does, it in reality actually just varies
> the FPP. You no longer should get the desired FPP, if you just input a
> wrong NDV count into the sizing formula. If you want a smaller bloom
> filter, you should raise the FPP. But that's just theory; in the end, you
> pick a size for your bloom filter and it doesn't matter what parameter you
> changed in your sizing formula to arrive at that value. It just irked me
> that a non-variable parameter is varied.
>
>
> Now, if I understand the claims of the blog post correctly, the blog post
> basically says that if you underestimate the NDV count heavily, you still
> get great pruning results. And this just contradicts the FPP formula. As
> just elaborated, by underestimating the NDV count, you basically should get
> a higher FPP, which should reduce the number of row groups you can prune.
> But the measurement seems to hint that you can still prune perfectly. So
> basically, the blog post empirically finds that the FFP formula for bloom
> filters is wrong (or do I misunderstand the blog post here?).
>
> *Now comes the important point and I think here is where the problem lies:*
> The blog post doesn't say by how much the NDV is underestimated though. It
> mentions the true NDV count of the whole data set (called cardinality in
> the blog post), but it does not mention what that means for the true NDV
> count *per row group*. The latter is dependent on the sorting of the
> data. Is it sorted? Or is it random? In the former case, you will have way
> fewer NDVs per row group than in the latter case (around 100x fewer in the
> 1M cardinality case).
>
> *Let's assume the worst case *(close to random ordering of rows): For the
> 1M cardinality example, we would really also get a true 1M NDV count in
> each row group, so basically only unique values. In this case, per the FPP
> formula, if you only use a 8kb bloom filter, you should get the following
> FPP:
> [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in bits),
> we get basically 1.000, so 100% false positive probability. And that also
> matches intuition: Each NDV sets 8 bits, and since you only have 64'000
> bits to set, but you're setting 8 million bits, each bit ends up being 1
> with almost certain probability, so you end up with a useless bloom filter
> of only 1 bits.
>
> *Let's assume the best case:* If the values are sorted and we have 1
> million NDV count for 100 million values, that means we have 10'000 NDV
> count in each row group (Each value duplicated 100 times). For this case,
> the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, m =
> 64'000) .
>
> For this case, the bloom filter is definitely usable. However, for this
> case, the default Parquet sizing formula would have also not yielded 2MB,
> but rather 12kb, so if this is the case that your benchmark measures then
> your values are pretty much in agreement with the Parquet sizing formula
> and there is just no discrepancy.
>
> I guess your experiment actually measures the best case, as then your
> numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% row
> groups are filtered, while with 16kb (0.2% FPP) all row groups are
> correctly pruned. But this result just confirms that the sizing formula
> that Parquet is correct.
>
> So could it be that the blog post accidentally measures an NDV count of
> 10'000 while the intuition is given that we're dealing with an NDV count of
> 1'000'000 in the 1'000'000 cardinality case?
>
> Cheers
> Jan
>
>
> Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield <
> emkornfi...@gmail.com>:
>
>> I think the table is useful, I think there are calculators online that do
>> this pretty easily but putting it into the docs might allow at least some
>> users to avoid unpleasant surprises.  In terms of generalizing to smaller
>> NDV counts and there effectiveness we might just want to state the result
>> but provide strong caveats to benchmark with their own data?
>>
>> Cheers,
>> Micah
>>
>>
>> On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com>
>> wrote:
>>
>> > When we first looked into Parquet bloom filters[1] it was hard to
>> > understand how effective they would be for a given amount of space
>> > overhead.
>> >
>> > When we plugged our data's cardinality into the target ndv and fpp
>> > parameters, it implied 2MB bloom filters *per column* per row group
>> which
>> > was unacceptable.
>> >
>> > However, when we did empirical tests (see Blog[2] from Trevor Hilton),
>> we
>> > found 2K-8K worked quite well.
>> >
>> > Is there any interest in porting some of the information from the blog
>> into
>> > the spec (specifically the tables of size based of fpp/ndv)? Or is this
>> > better as a third-party resource / exercise for the reader?
>> >
>> > Andrew
>> >
>> > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/
>> > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/
>> >
>>
>

Reply via email to