wgtmac commented on issue #37559: URL: https://github.com/apache/arrow/issues/37559#issuecomment-1828931646
> > Finally I have got some time to complete the design doc drafted by @mapleFU: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/. > > This proposes a number of reader APIs based on row ranges, but never says how row ranges are computed in the first place? I agree with @emkornfield that there are many approaches to produce row ranges. AFAIK, many downstream projects have different expression APIs and only use the arrow layer of parquet-cpp (not the dataset layer). It is difficult to determine a single approach of producing row ranges for all downstream projects, but it is easy to make the agreement to push down row ranges to the parquet reader to achieve filtering. Therefore I leave the freedom of different engines to design their own logic to produce row ranges. At the moment, I have marked filtering support of parquet dataset reader as a non-goal in the doc. My idea is to use expressions from Arrow compute to do something similar to what parquet-mr does: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L65-L104. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
