tustvold commented on issue #1191: URL: https://github.com/apache/arrow-rs/issues/1191#issuecomment-1023646847
Definitely. Irrespective of how the pruning takes places - whether a scan mask as proposed here, or information derived from a page index, or some other mechanism, there's definitely no point doing IO for pages that aren't going to be used. :+1: My hope with the scan masks is to provide an API that allows query engines to express what rows they want, without having to worry about the mechanics of what pages, etc... to fetch and decode for those rows, what statistics are available, how the writer interpreted the ambiguous parquet specifications, etc... This would leave the implementation in arrow-rs free to choose the most efficient way to get the requested rows of data. As an added benefit this approach would integrate well with the async plumbing I stubbed out in #1154, which needs to know what data is going to be needed ahead of time. This scan mask approach is heavily geared to the use-case of IOx where the page index is likely to not be very helpful, but they're definitely complementary approaches. I'd imagine it is possible to construct the partial scan logic in such a way that it works well for both. Ultimately the page index is just a way to generate a low granularity scan mask :smile: All that is to say, I agree entirely with handling these concerns in arrow-rs as you suggest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
