Hello everyone,
while working on a parquet writer, I found an issue related to the bloom
filter table properties.

Currently, the iceberg specification
<https://iceberg.apache.org/docs/latest/configuration/#write-properties>
defines 3 table properties for configuring bloom filters:

write.parquet.bloom-filter-enabled.column.col1

(not set)

Hint to parquet to write a bloom filter for the column: 'col1'

write.parquet.bloom-filter-max-bytes

1048576 (1 MB)

The maximum number of bytes for a bloom filter bitset

write.parquet.bloom-filter-fpp.column.col1

0.01

The false positive probability for a bloom filter applied to 'col1' (must >
0.0 and < 1.0)

Looking at the parquet-java implementation
<https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>,
the fpp value for a given column is ignored if the ndv for that column is
not specified.

Being that the iceberg spec does not define a property for this and that
there is no default, the implementation always ignores the fpp
property and uses
the bloom-filter-max-bytes as the exact size instead
<https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217>
(if the bloom filter is enabled for the column).


My proposal is to define a new table property
'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable
configuring the ndv to use.

In addition, it also should be discussed if not specifying the ndv but
specifying the fpp should be a config "error" (or simply ignored like
parquet-java is doing) or if it should use a default ndv instead.

What do you think should be done regarding this?

Best regards,
André Rosa

Reply via email to