thinkharderdev commented on issue #2270: URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201794786
So basically the implementation in DataFusion is: 1. Take the filter predicate and all projections it needs to evaluate and wrap a `ParquetRecordBatchStream` in a stream that will poll, apply the filter and emit the filtered batch along with corresponding row selections. 2. Take all the other columns and create a `ParquetRecordBatchStream` for them 3. Wrap those two streams in another stream which pools the filter stream, then polls the selection stream with the row selection. 4. Graft those batches onto each other to produce the output batch. Using your example, you would need to decode/filter the entire row group for your filter columns then apply the predicate to generate the row selection for the remaining columns. So you'd have to buffer the entire row group in memory for those columns. That's not too terrible but it seems like it would add overhead and make the decision of whether to try and apply a row-level filter higher stakes. I though that ideally we would be able to do the filtering one batch at a time. That should be very little overhead (aside from computing the filter expression which has to be done at some point anyway) so we can always apply a row filter if one is available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
