[GitHub] [iceberg] Fokko commented on issue #6973: PyIceberg: ORC file format support

via GitHub Thu, 02 Mar 2023 01:21:42 -0800


Fokko commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1451550761


   @alaturqua thanks for raising this issue.
   
   Some steps on how I would add ORC support. First I would refactor the code:
   
https://github.com/apache/iceberg/blob/35692a360ffd2d1b9107ec6018f9fd97fda04671/python/pyiceberg/io/pyarrow.py#L484-L519
   
   By removing the `pq` package from there, and read in a PyArrow fragment 
instead:
   ```python
   import pyarrow.dataset as ds
   from pyarrow.fs import S3FileSystem
   
   with fs.open_input_file(path) as fout:
       # First get the fragment
       fragment = parquet_format.make_fragment(fout, None)
       print(f"Schema: {fragment.physical_schema}")
       arrow_table = ds.Scanner.from_fragment(
           fragment=fragment
       ).to_table()
   ```
   
   We need to add logic to decide the `Format` based on the extension of the 
file, we can read in different fragments:
   - 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html
   - 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.OrcFileFormat.html
   
   The fragment has this option to say `physicial_schema` that will return the 
schema in PyArrow.
   
   Note: For Parquet we actually fetch the Iceberg schema in JSON format from 
the metadata currently. https://github.com/apache/iceberg/issues/6505 Will 
provide the logic to convert a PyArrow schema into an Iceberg schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on issue #6973: PyIceberg: ORC file format support

Reply via email to