Hello everyone, while working on a parquet writer, I found an issue related to the bloom filter table properties.
Currently, the iceberg specification <https://iceberg.apache.org/docs/latest/configuration/#write-properties> defines 3 table properties for configuring bloom filters: write.parquet.bloom-filter-enabled.column.col1 (not set) Hint to parquet to write a bloom filter for the column: 'col1' write.parquet.bloom-filter-max-bytes 1048576 (1 MB) The maximum number of bytes for a bloom filter bitset write.parquet.bloom-filter-fpp.column.col1 0.01 The false positive probability for a bloom filter applied to 'col1' (must > 0.0 and < 1.0) Looking at the parquet-java implementation <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>, the fpp value for a given column is ignored if the ndv for that column is not specified. Being that the iceberg spec does not define a property for this and that there is no default, the implementation always ignores the fpp property and uses the bloom-filter-max-bytes as the exact size instead <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217> (if the bloom filter is enabled for the column). My proposal is to define a new table property 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable configuring the ndv to use. In addition, it also should be discussed if not specifying the ndv but specifying the fpp should be a config "error" (or simply ignored like parquet-java is doing) or if it should use a default ndv instead. What do you think should be done regarding this? Best regards, André Rosa