alamb commented on PR #7360: URL: https://github.com/apache/arrow-rs/pull/7360#issuecomment-2766609690
> > I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs > > I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct `Sbbf` outside of `parquet`), also the related APIs is not convenient enough to be used in public at the moment. You can certainly access and use Sbbf outside the parquet crate, for example Datafusion does to to prune out row groups and data pages here: https://github.com/apache/datafusion/blob/6d5e00ad3f8e53f7252cb1d3c72a6c7f28c1aed6/datafusion/datasource-parquet/src/row_group_filter.rs#L236-L235 What is the use case for constructing `Sbbf`? I think it would be find to make that public in the crate > > > That being said, as you show here it is non trivial to implement row group / page filtering. > > That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details, Yes indeed it is not trivial to implement a fast parquet reader integrated with a query engine > if parquet contains a first-party `TableProvider` implementation, it is good to me. What do you mean by "TableProvider" ? If you are using DataFusion already, perhaps you can use the built in parquet reader (`ListingTableProvider`) that already has all these optimizations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
