HuaHuaY opened a new pull request, #50008:
URL: https://github.com/apache/arrow/pull/50008

   ### Rationale for this change
   
   This PR follows https://github.com/apache/arrow-rs/pull/9628. It supports 
optimizing the disk usage of the Bloom filter. So specifying an ndv value 
larger than the actual value will not affect disk usage.
   
   > Bloom filters now support folding mode: allocate a conservatively large 
filter (sized for worst-case NDV), insert all values during writing, then fold 
down at flush time to meet a target FPP. This eliminates the need to guess NDV 
upfront and produces optimally-sized filters automatically.
   
   ### What changes are included in this PR?
   
   `BloomFilterBuilder` will try to fold the bloom filter before writing it to 
the output stream.
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes. 
   
   `ndv` in `BloomFilterOptions` is changed from `int32_t` to 
`std::optional<int64_t>` . And the argument type of `OptimalNumOfBytes` and 
`OptimalNumOfBits` in `BlockSplitBloomFilter` is changed from `uint32_t ndv` to 
`uint64_t ndv`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to