zhuqi-lucas commented on PR #7454:
URL: https://github.com/apache/arrow-rs/pull/7454#issuecomment-2872665713

   > First of all, thank you so much @zhuqi-lucas
   > 
   > I really like your code in `FilteredParquetRecordBatchReader` -- the idea 
of combining the application of the `RowFilter` and the decoding of the 
projection into a single reader I think is a key insight and maybe points the 
way towards not decoding twce
   > 
   > After reviewing this code, it seems to me that a lot of work is done to 
use the same `RowSelection` structure for both
   > 
   > 1. Skipping large contiguous chunks of rows (e.g row groups and entire 
pages)
   > 2. Applying a RowFilter for filtering individual rows
   > 
   > I think `RowSelection` is well designed for the former, but quite bad for 
the latter (applying RowFilter)
   > 
   > As this PR starts down the path of separating the two concerns, I wonder 
if you have thought about pushing it even farther ? Something like keeping the 
results of the `RowFilter` only as `BooleanArrays` and then progressively 
decoding the remaining projections?
   
   Thank you @alamb fo review, i agree we can go further for this PR, i will 
try to do it.
   
   
   And the key improvement to reduce the regression is when average selection 
number is < 10, we will fallback to read all the row then to filter, and which 
is faster because it's vectorized better.
   ```rust
                       if total < 10 * select_count {
                           // Bitmap branch
                           let bitmap = self.create_bitmap_from_ranges(&runs);
                           match self.array_reader.read_records(bitmap.len()) {
                               Ok(_) => {}
                               Err(e) => return Some(Err(e.into())),
                           };
                           mask_builder.append_buffer(bitmap.values());
                           rows_accum += bitmap.true_count();
                       }
   ```
   
   
   
   I agree create_bitmap_from_ranges has some overhead, if we can return the 
bitmap from the predicate filter, we will have better performance. I will try 
to do this improvement also.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to