Hi Huaxin, I'll start a new thread on parquet-dev. Thank you, André Rosa On Wed, Sep 17, 2025 at 5:39 PM huaxin gao <huaxin.ga...@gmail.com> wrote:
> Thanks André for raising this! > +1 to adding write.parquet.bloom-filter-ndv.column.<col> to configure > NDV. For the “FPP without NDV” case, let’s defer to the Parquet community > (error vs ignore vs default NDV); Iceberg will follow their decision. Would > you like to start a thread on parquet-dev, or I’m happy to do it? > > Thanks, > Huaxin > > On Wed, Sep 17, 2025 at 3:46 AM André Rosa <andre.r...@dremio.com.invalid> > wrote: > >> Hello everyone, >> while working on a parquet writer, I found an issue related to the bloom >> filter table properties. >> >> Currently, the iceberg specification >> <https://iceberg.apache.org/docs/latest/configuration/#write-properties> >> defines 3 table properties for configuring bloom filters: >> >> write.parquet.bloom-filter-enabled.column.col1 >> >> (not set) >> >> Hint to parquet to write a bloom filter for the column: 'col1' >> >> write.parquet.bloom-filter-max-bytes >> >> 1048576 (1 MB) >> >> The maximum number of bytes for a bloom filter bitset >> >> write.parquet.bloom-filter-fpp.column.col1 >> >> 0.01 >> >> The false positive probability for a bloom filter applied to 'col1' (must >> > 0.0 and < 1.0) >> >> Looking at the parquet-java implementation >> <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>, >> the fpp value for a given column is ignored if the ndv for that column is >> not specified. >> >> Being that the iceberg spec does not define a property for this and that >> there is no default, the implementation always ignores the fpp property and >> uses >> the bloom-filter-max-bytes as the exact size instead >> <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217> >> (if the bloom filter is enabled for the column). >> >> >> My proposal is to define a new table property >> 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable >> configuring the ndv to use. >> >> In addition, it also should be discussed if not specifying the ndv but >> specifying the fpp should be a config "error" (or simply ignored like >> parquet-java is doing) or if it should use a default ndv instead. >> >> What do you think should be done regarding this? >> >> Best regards, >> André Rosa >> >