alamb opened a new issue, #9667:
URL: https://github.com/apache/arrow-rs/issues/9667

   ## Is your feature request related to a problem or challenge?
   
   PR #9628 added bloom filter folding and improved the default sizing behavior 
for bloom filters. During review, we discussed a follow-on API cleanup to make 
bloom filter configuration more explicit:
   
   https://github.com/apache/arrow-rs/pull/9628#discussion_r3022040191
   
   Today, `BloomFilterProperties` is still a public struct with public fields, 
which makes the API harder to evolve cleanly. The current builder 
implementation also has implicit internal state around whether NDV was 
explicitly set, and that state is not surfaced clearly in the API.
   
   While reviewing #9628, I also found a related behavioral issue in the 
current builder flow: disabling and re-enabling bloom filters does not fully 
reset the hidden "explicit NDV was set" state. That means re-enabling can 
preserve stale configuration intent instead of returning to the default 
behavior. The relevant disable path is also not covered by tests, and the 
per-column implicit-NDV resolution path is currently uncovered as well.
   
   ## Describe the solution you'd like
   
   Add a `BloomFilterPropertiesBuilder` and make bloom filter configuration 
more explicit.
   
   In particular, this follow-on change should aim to:
   
   - introduce a dedicated builder for bloom filter configuration
   - make the difference between:
     - default NDV derived from row group sizing
     - explicitly configured NDV
     - explicitly configured FPP
     - enabling/disabling bloom filters
     clear in the public API
   - reduce reliance on direct mutation of `BloomFilterProperties` public fields
   - make it easier to evolve this API in a future major release
   
   This API cleanup should also resolve the state-management issue found in the 
current implementation:
   
   - disabling bloom filters should fully reset associated bloom filter 
configuration state
   - re-enabling without reapplying explicit settings should use the default 
resolution path again
   
   ## Describe alternatives you've considered
   
   Keep the current API and add more documentation around the current semantics.
   
   For example, a more explicit API could look something like:
   
   ```rust
   let props = WriterProperties::builder()
       .set_column_bloom_filter_properties(
           ColumnPath::from("col"),
           BloomFilterProperties::builder()
               .with_fpp(0.01)
               .with_max_ndv(10_000)
               .build(),
       )
       .build();
   ```
   
   Compared with direct public-field configuration, a builder like this would 
make it clearer which values are explicitly configured and which are left to 
defaults.
   
   However, keeping the current API would preserve a configuration model that 
is still fairly implicit, and it would not address the underlying awkwardness 
of encoding builder-only state around explicit-vs-default NDV behavior.
   
   ## Additional context
   
   Follow-on from PR #9628:
   - https://github.com/apache/arrow-rs/pull/9628
   
   Relevant review discussion:
   - https://github.com/apache/arrow-rs/pull/9628#discussion_r3022040191
   
   Additional follow-up work for this issue should include regression tests for:
   - disable -> re-enable bloom filter configuration
   - column-specific bloom filter configuration with implicit NDV resolution
   - explicit vs default NDV behavior
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to