[GitHub] [arrow] Fokko commented on issue #33972: [Python] Remove redundant S3 call

via GitHub Fri, 03 Feb 2023 09:15:09 -0800


Fokko commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1416165705


   > We don't have a format-agnostic concept of "read the metadata but cache it 
for use later so you don't have to read it again".
   
   That's not a problem, as long as it keeps cached in the fragment. Because 
the reverse bytes to get the footer are rather expensive (in terms of time), so 
we would love to eliminate that call. I went through the code, and was able to 
pass down the metadata from the fragment down to the reader: 
https://github.com/apache/arrow/pull/34015
   
   > The simple ParquetFile interface for single files doesn't support 
filtering row groups with a filter, so that would be a step back from using 
`pq.read_table`?
   
   I agree, we need to have predicate pushdown 👍🏻 
   
   > Longer term, you can probably just specify a [custom evolution 
strategy](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/dataset.h#L254)
 (using parquet column IDs) and let pyarrow handle the expression conversion 
for you. Sadly, this feature is not yet ready (I'm working on it when I can. 🤞 
for 12.0.0)
   
   Let me know when something is ready, happy to test 👍🏻 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Fokko commented on issue #33972: [Python] Remove redundant S3 call

Reply via email to