I created this PR https://github.com/apache/iceberg/pull/14244 When you have the time, please review
Best regards, André Rosa On Wed, Oct 1, 2025 at 10:31 AM André Rosa <[email protected]> wrote: > Hello, > I'll do it. Was just waiting for more replies in this thread and replies > in the parquet-dev mailing list regarding the default behavior. > > Best regards, > André Rosa > > > On Wed, Oct 1, 2025 at 12:01 AM huaxin gao <[email protected]> wrote: > >> Hi Andre, >> Do you want to open a PR to add write.parquet.bloom-filter >> -ndv.column.<col> to configure NDV? I am happy to do it too. >> >> Thanks, >> Huaxin >> >> On Wed, Sep 17, 2025 at 10:02 AM André Rosa <[email protected]> >> wrote: >> >>> Hi Huaxin, >>> I'll start a new thread on parquet-dev. >>> Thank you, >>> André Rosa >>> >>> On Wed, Sep 17, 2025 at 5:39 PM huaxin gao <[email protected]> >>> wrote: >>> >>>> Thanks André for raising this! >>>> +1 to adding write.parquet.bloom-filter-ndv.column.<col> to configure >>>> NDV. For the “FPP without NDV” case, let’s defer to the Parquet community >>>> (error vs ignore vs default NDV); Iceberg will follow their decision. Would >>>> you like to start a thread on parquet-dev, or I’m happy to do it? >>>> >>>> Thanks, >>>> Huaxin >>>> >>>> On Wed, Sep 17, 2025 at 3:46 AM André Rosa >>>> <[email protected]> wrote: >>>> >>>>> Hello everyone, >>>>> while working on a parquet writer, I found an issue related to the >>>>> bloom filter table properties. >>>>> >>>>> Currently, the iceberg specification >>>>> <https://iceberg.apache.org/docs/latest/configuration/#write-properties> >>>>> defines 3 table properties for configuring bloom filters: >>>>> >>>>> write.parquet.bloom-filter-enabled.column.col1 >>>>> >>>>> (not set) >>>>> >>>>> Hint to parquet to write a bloom filter for the column: 'col1' >>>>> >>>>> write.parquet.bloom-filter-max-bytes >>>>> >>>>> 1048576 (1 MB) >>>>> >>>>> The maximum number of bytes for a bloom filter bitset >>>>> >>>>> write.parquet.bloom-filter-fpp.column.col1 >>>>> >>>>> 0.01 >>>>> >>>>> The false positive probability for a bloom filter applied to 'col1' >>>>> (must > 0.0 and < 1.0) >>>>> >>>>> Looking at the parquet-java implementation >>>>> <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>, >>>>> the fpp value for a given column is ignored if the ndv for that column is >>>>> not specified. >>>>> >>>>> Being that the iceberg spec does not define a property for this and >>>>> that there is no default, the implementation always ignores the fpp >>>>> property and uses the bloom-filter-max-bytes as the exact size instead >>>>> <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217> >>>>> (if the bloom filter is enabled for the column). >>>>> >>>>> >>>>> My proposal is to define a new table property >>>>> 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable >>>>> configuring the ndv to use. >>>>> >>>>> In addition, it also should be discussed if not specifying the ndv but >>>>> specifying the fpp should be a config "error" (or simply ignored like >>>>> parquet-java is doing) or if it should use a default ndv instead. >>>>> >>>>> What do you think should be done regarding this? >>>>> >>>>> Best regards, >>>>> André Rosa >>>>> >>>>
