zhuqi-lucas commented on PR #7513: URL: https://github.com/apache/arrow-rs/pull/7513#issuecomment-2916628725
> Status update > > # Current State of this PR > 1. Caches the results of the most recent filter which is applied during parquet decode > 2. Contains an initial implementation of `ArrayBuilderExtFilter` and `ArrayBuilderExtConcat` which permit incrementally building arrays without materializing the intermediate results (prototype API from [Optimize take/filter/concat from multiple input arrays to a single large output array #6692](https://github.com/apache/arrow-rs/issues/6692)) > 3. Contains `IncrementalRecordBatchBuilder` that incrementally builds record batches from filtered results. > > The use of the incremental builders saves at least one memory copy during filtering and reduces the buffering required (which also might increase speed). It will also reduce the times we have to rewrite StringView which will help > > # Next Steps > I next plan to: > > 1. Run arrow-rs benchmarks to show it helping > 2. Do a POC in DataFusion using the IncrementalRecordBatchBuilder in FilterExec to see if it makes a difference there > > If those tests look good, I will begin breaking this PR up into smaller pieces for review > > Major items I know are needed: > > 1. Memory limiting for cached results in the parquet reader > 2. Updating previous cached results with subsequent filters > 3. Benchmarks showing the effect of using incremental filtering / append compared to filter and concat Great work, thank you @alamb , i will study and review the details code tomorrow! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org