[I] API for filtering / evaluation directly on encoded data [arrow-rs]

via GitHub Fri, 14 Nov 2025 06:08:37 -0800


alamb opened a new issue, #8842:
URL: https://github.com/apache/arrow-rs/issues/8842


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   The idea of [executing directly on "compressed" (encoded) 
data](https://www.cs.umd.edu/~abadi/papers/abadisigmod06.pdf) is a well known 
technique in the academic columnar store literature. The Parquet format 
currently supports several different 
[encodings](https://github.com/apache/parquet-format/blob/master/Encodings.md) 
such as:
   - Plain
   - Dictionary (and hybrid)
   - RLE/Bitpacked
   - DeltaBinaryPacked
   - DeltaByteArray
   - ByteStreamSplit
   
   The current Rust Parquet reader supports evaluating predicates during the 
scan via the  
[ArrowReaderBuilder::with_row_filter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter).
 However, the current 
[RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html)
 API only permits evaluation on Arrow arrays, aka **after** decompression and 
decoding. 
   
   It would be interesting to consider supporting evaluating predicates 
directly on parquet encoded data, to avoid decode costs when unnecessary -- for 
example when doing "needle in the haystack" type queries that filter out almost 
all rows
   
   As the community considers [adding more encodings], such as ALP and FSST 
that are even more amenable to encoded operation, the benefit of such an API 
will increase. 
   
   
   **Describe the solution you'd like**
   I would like some way to evaluate predicates directly on the encoded Parquet 
data
   
   **Describe alternatives you've considered**
   
   One potential thing we could do is extend the existing RowFilter API so 
different implementations could be provided for different encodings
   ```rust
   // add predicates evaluated on (decoded) Arrow arrays
   let filter = RowFilter::new(arrow_predicates)
     // add specializations that can evaluate directly on RLE encoded data
     // if the data is not RLE encoded, fall back to the arrow ones
     .with_rle_predicates(...)
   ```
   
   We would have to do some more experiments to know what the API for RLE (and 
similar) predicates should be
     
   
   **Additional context**
   
   The Vortex file implementation seems to be heading towards adding their own 
Expression implementation for common expressions and then providing the 
   
   For example, see how VortexExpr::evaluate looks (it has all the knowledge of 
the types, etc)
   
https://docs.rs/vortex-expr/0.54.0/vortex_expr/trait.VortexExpr.html#method.evaluate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] API for filtering / evaluation directly on encoded data [arrow-rs]

Reply via email to