[GitHub] [iceberg] JonasJ-ap commented on issue #6973: PyIceberg: ORC file format support

via GitHub Sat, 11 Mar 2023 20:31:57 -0800


JonasJ-ap commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1465090060


   I would like to share a quick test on reading field metadata from ORC file 
(from an iceberg table created on AWS Athena engine v3) by pyarrow. It seems 
that the pa.Schema of the ORC file does not contain any field_id information 
even though they do exist in the ORC file:
   
   I first use `orc-tools` to inspect the metadata of the ORC file and find 
that the field id is stored in `iceberg.id`:
   ```zsh
   orc-tools meta 
/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
   log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
   log4j:WARN Please initialize the log4j system properly.
   log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
   Processing data file 
/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
 [length: 1104]
   Structure for 
/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
   File Version: 0.12 with TRINO_ORIGINAL by Trino 
   Rows: 1
   Compression: ZSTD
   Compression size: 262144
   Calendar: Julian/Gregorian
   Type: 
struct<user_id:int,user_name:string,status:string,cost:double,quantity:int,quantity_big:int,creation_date:date,created_at:timestamp,inserted_at:timestamp>
   Attributes on root.user_id
     iceberg.id: 1
     iceberg.required: false
   Attributes on root.user_name
     iceberg.id: 2
     iceberg.required: false
   Attributes on root.status
     iceberg.id: 3
     iceberg.required: false
   Attributes on root.cost
     iceberg.id: 4
     iceberg.required: false
   Attributes on root.quantity
     iceberg.id: 5
     iceberg.required: false
   ```
   
   However, when I used the similar way in #7033 to get the pyarrow schema, the 
field id info get lost:
   ```python
   orc_test_path = 
"/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc"
   fs = LocalFileSystem()
   with fs.open_input_file(orc_test_path) as f:
       # print(po.ORCFile(f).read().schema)
       print(ds.OrcFileFormat().make_fragment(f).physical_schema)
   ```
   ```zsh
   user_id: int32
   user_name: string
   status: string
   cost: double
   quantity: int32
   quantity_big: int32
   creation_date: date32[day]
   created_at: timestamp[ns]
   inserted_at: timestamp[ns]
   -- schema metadata --
   presto_query_id: '20230311_203248_00007_2vuin'
   trino.writer.version: '0.215-16024-g6eec71f'
   presto_version: '0.215-16024-g6eec71f'
   ```
   The pyarrow schema containing field id info should look like:
   ```zsh
   user_id: int32
     -- field metadata --
     PARQUET:field_id: '1'
   user_name: string
     -- field metadata --
     PARQUET:field_id: '2'
   status: string
     -- field metadata --
     PARQUET:field_id: '3'
   ...
   ```
   
    It seems even with #6997 merged, we still cannot fully support the ORC file 
format if we use pyarrow to read it. The name mapping feature may need to be 
implemented here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] JonasJ-ap commented on issue #6973: PyIceberg: ORC file format support

Reply via email to