Re: [PR] POC: Sketch out cached filter result API [arrow-rs]

via GitHub Wed, 28 May 2025 07:35:33 -0700


alamb commented on PR #7513:
URL: https://github.com/apache/arrow-rs/pull/7513#issuecomment-2916587149


   Status update
   
   # Current State of this PR
   1. Caches the results of the most recent filter which is applied during 
parquet decode 
   2. Contains an initial  implementation of `ArrayBuilderExtFilter` and 
`ArrayBuilderExtConcat` which permit incrementally building  arrays without 
materializing the intermediate results (prototype API from 
https://github.com/apache/arrow-rs/issues/6692)
   3. Contains `IncrementalRecordBatchBuilder` that incrementally builds record 
batches from filtered results. 
   
   The use of the incremental builders saves at least one memory copy during 
filtering and reduces the buffering required (which also might increase speed). 
It will also reduce the times we have to rewrite StringView which will help
   
   So this PR now:
   1. Caches the results of the most recent filter which is applied during 
parquet decode
   
   I next plan to:
   1. Run arrow-rs benchmarks to show it helping
   2. Do a POC in DataFusion using the IncrementalRecordBatchBuilder in 
FilterExec to see if it makes a difference there
   
   If those tests look good, I will begin breaking this PR up into smaller 
pieces for review
   
   Major items I know are needed:
   1. Memory limiting for cached results in the parquet reader
   2. Updating previous cached results with subsequent filters
   3. Benchmarks showing the effect of using incremental filtering / append 
compared to filter and concat
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] POC: Sketch out cached filter result API [arrow-rs]

Reply via email to