ethe opened a new issue, #9067:
URL: https://github.com/apache/arrow-rs/issues/9067
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
I'm integrating Parquet bloom filters into an async pruning pipeline and
found a gap in the public API.
Current situation
- There is a sync API:
`Sbbf::read_from_column_chunk(column_meta, reader)`
- There is an async method, but only on the async Arrow builder:
`ParquetRecordBatchStreamBuilder::get_row_group_column_bloom_filter(...)`
- The helper used internally to parse bloom filter headers is `pub(crate)`:
`chunk_read_bloom_filter_header_and_offset (in parquet::bloom_filter)`
If parquet crate only has `ParquetMetaData` + an `AsyncFileReader`,
downstream applications can't read bloom filters without:
1. using the builder (which requires &mut and ties you to Arrow's reader), or
2. re‑implementing Parquet bloom header parsing.
This blocks async metadata‑only pruning libraries (like ours) from using
bloom filters safely and efficiently.
**Describe the solution you'd like**
Expose a public async bloom reader that mirrors the sync API:
```rust
pub async fn read_bloom_filter_async<R: AsyncFileReader>(
column_meta: &ColumnChunkMetaData,
reader: &mut R
) -> Result<Option<Sbbf>>;
```
This would:
- keep internal header parsing private
- allow async pruning without coupling to Arrow builder
- avoid duplicate parsing logic in downstream crates
- be backwards compatible (pure API addition)
*Alternative*
Make `chunk_read_bloom_filter_header_and_offset` public, but this is a
low‑level parsing helper and would bake in more implementation detail.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]