When we first looked into Parquet bloom filters[1] it was hard to
understand how effective they would be for a given amount of space
overhead.

When we plugged our data's cardinality into the target ndv and fpp
parameters, it implied 2MB bloom filters *per column* per row group which
was unacceptable.

However, when we did empirical tests (see Blog[2] from Trevor Hilton), we
found 2K-8K worked quite well.

Is there any interest in porting some of the information from the blog into
the spec (specifically the tables of size based of fpp/ndv)? Or is this
better as a third-party resource / exercise for the reader?

Andrew

[1]: https://parquet.apache.org/docs/file-format/bloomfilter/
[2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/

Reply via email to