[GitHub] [arrow-rs] thinkharderdev commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

GitBox Mon, 01 Aug 2022 15:27:09 -0700


thinkharderdev commented on issue #2270:
URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201794786


   So basically the implementation in DataFusion is:
   
   1. Take the filter predicate and all projections it needs to evaluate and 
wrap a `ParquetRecordBatchStream` in a stream that will poll, apply the filter 
and emit the filtered batch along with corresponding row selections. 
   2. Take all the other columns and create a `ParquetRecordBatchStream` for 
them
   3. Wrap those two streams in another stream which pools the filter stream, 
then polls the selection stream with the row selection. 
   4. Graft those batches onto each other to produce the output batch. 
   
   Using your example, you would need to decode/filter the entire row group for 
your filter columns then apply the predicate to generate the row selection for 
the remaining columns. So you'd have to buffer the entire row group in memory 
for those columns. That's not too terrible but it seems like it would add 
overhead and make the decision of whether to try and apply a row-level filter 
higher stakes. I though that ideally we would be able to do the filtering one 
batch at a time. That should be very little overhead (aside from computing the 
filter expression which has to be done at some point anyway) so we can always 
apply a row filter if one is available. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] thinkharderdev commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

Reply via email to