[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

via GitHub Thu, 02 Feb 2023 17:29:57 -0800


westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1414589158


   > We need to make projections, and we need to have the schema before loading 
the data. For example, if you have an Iceberg table, and you do a rename on a 
column, then you don't want to rewrite your multi-petabyte table. Iceberg uses 
IDs to identify the column, and if you filter or project on that column, it 
will select the old column name in the files that are written before the rename.
   
   Ok, that helps.  In the short term I think you should use 
`pyarrow.parquet.ParquetFile`.  That's a direct binding to the parquet-cpp libs 
and won't use any of the dataset stuff.  We don't have a format-agnostic 
concept of "read the metadata but cache it for use later so you don't have to 
read it again".
   
   Longer term, you can probably just specify a [custom evolution 
strategy](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/dataset.h#L254)
 (using parquet column IDs) and let pyarrow handle the expression conversion 
for you.  Sadly, this feature is not yet ready (I'm working on it when I can. 
:crossed_fingers: for 12.0.0)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

Reply via email to