HuaHuaY opened a new pull request, #50008: URL: https://github.com/apache/arrow/pull/50008
### Rationale for this change This PR follows https://github.com/apache/arrow-rs/pull/9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage. > Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically. ### What changes are included in this PR? `BloomFilterBuilder` will try to fold the bloom filter before writing it to the output stream. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. `ndv` in `BloomFilterOptions` is changed from `int32_t` to `std::optional<int64_t>` . And the argument type of `OptimalNumOfBytes` and `OptimalNumOfBits` in `BlockSplitBloomFilter` is changed from `uint32_t ndv` to `uint64_t ndv` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
