I created this PR https://github.com/apache/iceberg/pull/14244
When you have the time, please review

Best regards,
André Rosa

On Wed, Oct 1, 2025 at 10:31 AM André Rosa <[email protected]> wrote:

> Hello,
> I'll do it. Was just waiting for more replies in this thread and replies
> in the parquet-dev mailing list regarding the default behavior.
>
> Best regards,
> André Rosa
>
>
> On Wed, Oct 1, 2025 at 12:01 AM huaxin gao <[email protected]> wrote:
>
>> Hi Andre,
>> Do you want to open a PR to add write.parquet.bloom-filter
>> -ndv.column.<col> to configure NDV? I am happy to do it too.
>>
>> Thanks,
>> Huaxin
>>
>> On Wed, Sep 17, 2025 at 10:02 AM André Rosa <[email protected]>
>> wrote:
>>
>>> Hi Huaxin,
>>> I'll start a new thread on parquet-dev.
>>> Thank you,
>>> André Rosa
>>>
>>> On Wed, Sep 17, 2025 at 5:39 PM huaxin gao <[email protected]>
>>> wrote:
>>>
>>>> Thanks André for raising this!
>>>> +1 to adding write.parquet.bloom-filter-ndv.column.<col> to configure
>>>> NDV. For the “FPP without NDV” case, let’s defer to the Parquet community
>>>> (error vs ignore vs default NDV); Iceberg will follow their decision. Would
>>>> you like to start a thread on parquet-dev, or I’m happy to do it?
>>>>
>>>> Thanks,
>>>> Huaxin
>>>>
>>>> On Wed, Sep 17, 2025 at 3:46 AM André Rosa
>>>> <[email protected]> wrote:
>>>>
>>>>> Hello everyone,
>>>>> while working on a parquet writer, I found an issue related to the
>>>>> bloom filter table properties.
>>>>>
>>>>> Currently, the iceberg specification
>>>>> <https://iceberg.apache.org/docs/latest/configuration/#write-properties>
>>>>> defines 3 table properties for configuring bloom filters:
>>>>>
>>>>> write.parquet.bloom-filter-enabled.column.col1
>>>>>
>>>>> (not set)
>>>>>
>>>>> Hint to parquet to write a bloom filter for the column: 'col1'
>>>>>
>>>>> write.parquet.bloom-filter-max-bytes
>>>>>
>>>>> 1048576 (1 MB)
>>>>>
>>>>> The maximum number of bytes for a bloom filter bitset
>>>>>
>>>>> write.parquet.bloom-filter-fpp.column.col1
>>>>>
>>>>> 0.01
>>>>>
>>>>> The false positive probability for a bloom filter applied to 'col1'
>>>>> (must > 0.0 and < 1.0)
>>>>>
>>>>> Looking at the parquet-java implementation
>>>>> <https://github.com/apache/parquet-java/blob/36a5f9cf8c1ce2c19631a0ec376665c5e41ea215/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L179-L192>,
>>>>> the fpp value for a given column is ignored if the ndv for that column is
>>>>> not specified.
>>>>>
>>>>> Being that the iceberg spec does not define a property for this and
>>>>> that there is no default, the implementation always ignores the fpp
>>>>> property and uses the bloom-filter-max-bytes as the exact size instead
>>>>> <https://github.com/apache/parquet-java/blob/299b0aea128645312badc329479920ddf8736577/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L205-L217>
>>>>> (if the bloom filter is enabled for the column).
>>>>>
>>>>>
>>>>> My proposal is to define a new table property
>>>>> 'write.parquet.bloom-filter-ndv.column.col1' in the spec to enable
>>>>> configuring the ndv to use.
>>>>>
>>>>> In addition, it also should be discussed if not specifying the ndv but
>>>>> specifying the fpp should be a config "error" (or simply ignored like
>>>>> parquet-java is doing) or if it should use a default ndv instead.
>>>>>
>>>>> What do you think should be done regarding this?
>>>>>
>>>>> Best regards,
>>>>> André Rosa
>>>>>
>>>>

Reply via email to