When we first looked into Parquet bloom filters[1] it was hard to understand how effective they would be for a given amount of space overhead.
When we plugged our data's cardinality into the target ndv and fpp parameters, it implied 2MB bloom filters *per column* per row group which was unacceptable. However, when we did empirical tests (see Blog[2] from Trevor Hilton), we found 2K-8K worked quite well. Is there any interest in porting some of the information from the blog into the spec (specifically the tables of size based of fpp/ndv)? Or is this better as a third-party resource / exercise for the reader? Andrew [1]: https://parquet.apache.org/docs/file-format/bloomfilter/ [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/