wgtmac commented on issue #37559:
URL: https://github.com/apache/arrow/issues/37559#issuecomment-1828931646

   > > Finally I have got some time to complete the design doc drafted by 
@mapleFU: 
https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/.
   > 
   > This proposes a number of reader APIs based on row ranges, but never says 
how row ranges are computed in the first place?
   
   I agree with @emkornfield that there are many approaches to produce row 
ranges.
   
   AFAIK, many downstream projects have different expression APIs and only use 
the arrow layer of parquet-cpp (not the dataset layer). It is difficult to 
determine a single approach of producing row ranges for all downstream 
projects, but it is easy to make the agreement to push down row ranges to the 
parquet reader to achieve filtering. Therefore I leave the freedom of different 
engines to design their own logic to produce row ranges.
   
   At the moment, I have marked filtering support of parquet dataset reader as 
a non-goal in the doc. My idea is to use expressions from Arrow compute to do 
something similar to what parquet-mr does: 
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L65-L104.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to