Fokko commented on issue #6973: URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1451550761
@alaturqua thanks for raising this issue. Some steps on how I would add ORC support. First I would refactor the code: https://github.com/apache/iceberg/blob/35692a360ffd2d1b9107ec6018f9fd97fda04671/python/pyiceberg/io/pyarrow.py#L484-L519 By removing the `pq` package from there, and read in a PyArrow fragment instead: ```python import pyarrow.dataset as ds from pyarrow.fs import S3FileSystem with fs.open_input_file(path) as fout: # First get the fragment fragment = parquet_format.make_fragment(fout, None) print(f"Schema: {fragment.physical_schema}") arrow_table = ds.Scanner.from_fragment( fragment=fragment ).to_table() ``` We need to add logic to decide the `Format` based on the extension of the file, we can read in different fragments: - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.OrcFileFormat.html The fragment has this option to say `physicial_schema` that will return the schema in PyArrow. Note: For Parquet we actually fetch the Iceberg schema in JSON format from the metadata currently. https://github.com/apache/iceberg/issues/6505 Will provide the logic to convert a PyArrow schema into an Iceberg schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
