I think the table is useful, I think there are calculators online that do this pretty easily but putting it into the docs might allow at least some users to avoid unpleasant surprises. In terms of generalizing to smaller NDV counts and there effectiveness we might just want to state the result but provide strong caveats to benchmark with their own data?
Cheers, Micah On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > When we first looked into Parquet bloom filters[1] it was hard to > understand how effective they would be for a given amount of space > overhead. > > When we plugged our data's cardinality into the target ndv and fpp > parameters, it implied 2MB bloom filters *per column* per row group which > was unacceptable. > > However, when we did empirical tests (see Blog[2] from Trevor Hilton), we > found 2K-8K worked quite well. > > Is there any interest in porting some of the information from the blog into > the spec (specifically the tables of size based of fpp/ndv)? Or is this > better as a third-party resource / exercise for the reader? > > Andrew > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/ > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/ >