tustvold commented on issue #1191:
URL: https://github.com/apache/arrow-rs/issues/1191#issuecomment-1023646847


   Definitely. Irrespective of how the pruning takes places - whether a scan 
mask as proposed here, or information derived from a page index, or some other 
mechanism, there's definitely no point doing IO for pages that aren't going to 
be used. :+1:
   
   My hope with the scan masks is to provide an API that allows query engines 
to express what rows they want, without having to worry about the mechanics of 
what pages, etc... to fetch and decode for those rows, what statistics are 
available, how the writer interpreted the ambiguous parquet specifications, 
etc...
   
   This would leave the implementation in arrow-rs free to choose the most 
efficient way to get the requested rows of data. As an added benefit this 
approach would integrate well with the async plumbing I stubbed out in #1154, 
which needs to know what data is going to be needed ahead of time.
   
   This scan mask approach is heavily geared to the use-case of IOx where the 
page index is likely to not be very helpful, but they're definitely 
complementary approaches. I'd imagine it is possible to construct the partial 
scan logic in such a way that it works well for both. Ultimately the page index 
is just a way to generate a low granularity scan mask :smile: 
   
   All that is to say, I agree entirely with handling these concerns in 
arrow-rs as you suggest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to