alamb commented on issue #7456:
URL: https://github.com/apache/arrow-rs/issues/7456#issuecomment-2872646158

   @zhuqi-lucas has a great insight in 
https://github.com/apache/arrow-rs/pull/7454 -- namely that instead of a two 
pass algorithm (evaluate `RowFilter` to form a final `RowSelection` and then 
re-decode the filter) we can combine the filter application and decode steps 
(see https://github.com/apache/arrow-rs/pull/7454#pullrequestreview-2833094545)
   
   
   The current flow goes something like:
   1. A set of array readers is created for the filter columns, and uses the 
provided RowSelection (this captures prunning
   pages ). 
   2. The decoded batches are used to evaluate the RowFilter / ArrowPredicates, 
which produces a `BooleanArray` bitmap
   3. The "final" `RowSelection` is created, by `union`ing the existing 
`RowSelection` with the `BooleanArrays`
   5. A new set of array readers  is created with the updated `RowSelection`
   
   The current PR starts heading down a slightly modified flow, where the 
RowSelection and RowFilters are not combined.
   
   I think a combined solution would look something like: 
   
   1. Create Decoders for filter columns and projection (only) columns
   
   Decoding proceeds like:
   1. read rows from initial `RowSelection` (reads a 8192 rows) from filter 
columns, if any
   2. Apply any RowFilters on it (produces a  BooleanArray)
   3. repeat 1-2 until there are at least 8192 (batch size) rows that pass the 
filter.  (This means we have `Vec<BooleanArray>` with 8192 1s and a Vec<Array> 
for each filter column that is also a projection column)
   5. Then decode as maby RecordBatches from the projection (only) columns 
using the initial `RowSelection`)
   6. Apply the filters to each array to form the final output batch (in 
projection columns)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to