westonpace commented on issue #33972: URL: https://github.com/apache/arrow/issues/33972#issuecomment-1414589158
> We need to make projections, and we need to have the schema before loading the data. For example, if you have an Iceberg table, and you do a rename on a column, then you don't want to rewrite your multi-petabyte table. Iceberg uses IDs to identify the column, and if you filter or project on that column, it will select the old column name in the files that are written before the rename. Ok, that helps. In the short term I think you should use `pyarrow.parquet.ParquetFile`. That's a direct binding to the parquet-cpp libs and won't use any of the dataset stuff. We don't have a format-agnostic concept of "read the metadata but cache it for use later so you don't have to read it again". Longer term, you can probably just specify a [custom evolution strategy](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/dataset.h#L254) (using parquet column IDs) and let pyarrow handle the expression conversion for you. Sadly, this feature is not yet ready (I'm working on it when I can. :crossed_fingers: for 12.0.0) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
