alamb commented on PR #7360: URL: https://github.com/apache/arrow-rs/pull/7360#issuecomment-2766763040
> > What is the use case for constructing Sbbf? > > I can not find the way to get `Sbbf` instances in the async read path of `parquet` crates, this only works with `SerializedRowGroupReader`, but it is synchronous, so I have to construct it manually from `bytes::Bytes`. Perhaps you can propose an API to do so (perhaps on ParquetMetadataReader) > I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say `TableProvider` provides similar semantics to the above API, but I'm not sure it is the best choice to be the first-party implementation in `parquet`. I agree implementing a table provider like interface in the parquet crate is likely not a good idea > > My biggest concern here is adding more code to maintain as part of this crate that may not be widely used > > Chroma(@HammadB) and also Tonbo both run into this issue. If the issue is that the public API of the parquet-rs crate doesn't allow you to implement pushdowns I agree we should extend the API to address whatever you are having trouble doing If the issue is that it is complex to implement parquet predicate pushdown, I am not sure that is a great fit for this crate because the details of implementing predicate pushdown vary significantly from system to system. For example 1. What predicates are supported ( do you support predicates like prefix matching, user defined functions, etc). 2. How do you evaluate predicates when there are multiple files (with potentially different but compatible schemas) 3. How do you evaluate predicates using information from an external metadata catalog (e.g. iceberg or similar) 4. How do you interleave fetching metadata, evaluating predicates, and scanning files It isn't clear to me where to draw the line between predicate evaluation and a full query engine. Maybe you and @HammadB can create some other crate (parquet-predicate-pushdown) implementing the specific pushdown APIs that you need. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
