alamb opened a new issue, #9667: URL: https://github.com/apache/arrow-rs/issues/9667
## Is your feature request related to a problem or challenge? PR #9628 added bloom filter folding and improved the default sizing behavior for bloom filters. During review, we discussed a follow-on API cleanup to make bloom filter configuration more explicit: https://github.com/apache/arrow-rs/pull/9628#discussion_r3022040191 Today, `BloomFilterProperties` is still a public struct with public fields, which makes the API harder to evolve cleanly. The current builder implementation also has implicit internal state around whether NDV was explicitly set, and that state is not surfaced clearly in the API. While reviewing #9628, I also found a related behavioral issue in the current builder flow: disabling and re-enabling bloom filters does not fully reset the hidden "explicit NDV was set" state. That means re-enabling can preserve stale configuration intent instead of returning to the default behavior. The relevant disable path is also not covered by tests, and the per-column implicit-NDV resolution path is currently uncovered as well. ## Describe the solution you'd like Add a `BloomFilterPropertiesBuilder` and make bloom filter configuration more explicit. In particular, this follow-on change should aim to: - introduce a dedicated builder for bloom filter configuration - make the difference between: - default NDV derived from row group sizing - explicitly configured NDV - explicitly configured FPP - enabling/disabling bloom filters clear in the public API - reduce reliance on direct mutation of `BloomFilterProperties` public fields - make it easier to evolve this API in a future major release This API cleanup should also resolve the state-management issue found in the current implementation: - disabling bloom filters should fully reset associated bloom filter configuration state - re-enabling without reapplying explicit settings should use the default resolution path again ## Describe alternatives you've considered Keep the current API and add more documentation around the current semantics. For example, a more explicit API could look something like: ```rust let props = WriterProperties::builder() .set_column_bloom_filter_properties( ColumnPath::from("col"), BloomFilterProperties::builder() .with_fpp(0.01) .with_max_ndv(10_000) .build(), ) .build(); ``` Compared with direct public-field configuration, a builder like this would make it clearer which values are explicitly configured and which are left to defaults. However, keeping the current API would preserve a configuration model that is still fairly implicit, and it would not address the underlying awkwardness of encoding builder-only state around explicit-vs-default NDV behavior. ## Additional context Follow-on from PR #9628: - https://github.com/apache/arrow-rs/pull/9628 Relevant review discussion: - https://github.com/apache/arrow-rs/pull/9628#discussion_r3022040191 Additional follow-up work for this issue should include regression tests for: - disable -> re-enable bloom filter configuration - column-specific bloom filter configuration with implicit NDV resolution - explicit vs default NDV behavior -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
