I think the table is useful, I think there are calculators online that do
this pretty easily but putting it into the docs might allow at least some
users to avoid unpleasant surprises.  In terms of generalizing to smaller
NDV counts and there effectiveness we might just want to state the result
but provide strong caveats to benchmark with their own data?

Cheers,
Micah


On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com> wrote:

> When we first looked into Parquet bloom filters[1] it was hard to
> understand how effective they would be for a given amount of space
> overhead.
>
> When we plugged our data's cardinality into the target ndv and fpp
> parameters, it implied 2MB bloom filters *per column* per row group which
> was unacceptable.
>
> However, when we did empirical tests (see Blog[2] from Trevor Hilton), we
> found 2K-8K worked quite well.
>
> Is there any interest in porting some of the information from the blog into
> the spec (specifically the tables of size based of fpp/ndv)? Or is this
> better as a third-party resource / exercise for the reader?
>
> Andrew
>
> [1]: https://parquet.apache.org/docs/file-format/bloomfilter/
> [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/
>

Reply via email to