zhuqi-lucas commented on PR #7454: URL: https://github.com/apache/arrow-rs/pull/7454#issuecomment-2872665713
> First of all, thank you so much @zhuqi-lucas > > I really like your code in `FilteredParquetRecordBatchReader` -- the idea of combining the application of the `RowFilter` and the decoding of the projection into a single reader I think is a key insight and maybe points the way towards not decoding twce > > After reviewing this code, it seems to me that a lot of work is done to use the same `RowSelection` structure for both > > 1. Skipping large contiguous chunks of rows (e.g row groups and entire pages) > 2. Applying a RowFilter for filtering individual rows > > I think `RowSelection` is well designed for the former, but quite bad for the latter (applying RowFilter) > > As this PR starts down the path of separating the two concerns, I wonder if you have thought about pushing it even farther ? Something like keeping the results of the `RowFilter` only as `BooleanArrays` and then progressively decoding the remaining projections? Thank you @alamb fo review, i agree we can go further for this PR, i will try to do it. And the key improvement to reduce the regression is when average selection number is < 10, we will fallback to read all the row then to filter, and which is faster because it's vectorized better. ```rust if total < 10 * select_count { // Bitmap branch let bitmap = self.create_bitmap_from_ranges(&runs); match self.array_reader.read_records(bitmap.len()) { Ok(_) => {} Err(e) => return Some(Err(e.into())), }; mask_builder.append_buffer(bitmap.values()); rows_accum += bitmap.true_count(); } ``` I agree create_bitmap_from_ranges has some overhead, if we can return the bitmap from the predicate filter, we will have better performance. I will try to do this improvement also. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org