Re: [PR] POC: Sketch out cached filter result API [arrow-rs]

via GitHub Wed, 28 May 2025 07:49:26 -0700


zhuqi-lucas commented on PR #7513:
URL: https://github.com/apache/arrow-rs/pull/7513#issuecomment-2916628725


   > Status update
   > 
   > # Current State of this PR
   > 1. Caches the results of the most recent filter which is applied during 
parquet decode
   > 2. Contains an initial  implementation of `ArrayBuilderExtFilter` and 
`ArrayBuilderExtConcat` which permit incrementally building  arrays without 
materializing the intermediate results (prototype API from [Optimize 
take/filter/concat from multiple input arrays to a single large output array 
#6692](https://github.com/apache/arrow-rs/issues/6692))
   > 3. Contains `IncrementalRecordBatchBuilder` that incrementally builds 
record batches from filtered results.
   > 
   > The use of the incremental builders saves at least one memory copy during 
filtering and reduces the buffering required (which also might increase speed). 
It will also reduce the times we have to rewrite StringView which will help
   > 
   > # Next Steps
   > I next plan to:
   > 
   > 1. Run arrow-rs benchmarks to show it helping
   > 2. Do a POC in DataFusion using the IncrementalRecordBatchBuilder in 
FilterExec to see if it makes a difference there
   > 
   > If those tests look good, I will begin breaking this PR up into smaller 
pieces for review
   > 
   > Major items I know are needed:
   > 
   > 1. Memory limiting for cached results in the parquet reader
   > 2. Updating previous cached results with subsequent filters
   > 3. Benchmarks showing the effect of using incremental filtering / append 
compared to filter and concat
   
   Great work, thank you @alamb , i will study and review the details code 
tomorrow!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] POC: Sketch out cached filter result API [arrow-rs]

Reply via email to