JonasJ-ap commented on issue #6973: URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1465090060
I would like to share a quick test on reading field metadata from ORC file (from an iceberg table created on AWS Athena engine v3) by pyarrow. It seems that the pa.Schema of the ORC file does not contain any field_id information even though they do exist in the ORC file: I first use `orc-tools` to inspect the metadata of the ORC file and find that the field id is stored in `iceberg.id`: ```zsh orc-tools meta /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc [length: 1104] Structure for /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc File Version: 0.12 with TRINO_ORIGINAL by Trino Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<user_id:int,user_name:string,status:string,cost:double,quantity:int,quantity_big:int,creation_date:date,created_at:timestamp,inserted_at:timestamp> Attributes on root.user_id iceberg.id: 1 iceberg.required: false Attributes on root.user_name iceberg.id: 2 iceberg.required: false Attributes on root.status iceberg.id: 3 iceberg.required: false Attributes on root.cost iceberg.id: 4 iceberg.required: false Attributes on root.quantity iceberg.id: 5 iceberg.required: false ``` However, when I used the similar way in #7033 to get the pyarrow schema, the field id info get lost: ```python orc_test_path = "/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc" fs = LocalFileSystem() with fs.open_input_file(orc_test_path) as f: # print(po.ORCFile(f).read().schema) print(ds.OrcFileFormat().make_fragment(f).physical_schema) ``` ```zsh user_id: int32 user_name: string status: string cost: double quantity: int32 quantity_big: int32 creation_date: date32[day] created_at: timestamp[ns] inserted_at: timestamp[ns] -- schema metadata -- presto_query_id: '20230311_203248_00007_2vuin' trino.writer.version: '0.215-16024-g6eec71f' presto_version: '0.215-16024-g6eec71f' ``` The pyarrow schema containing field id info should look like: ```zsh user_id: int32 -- field metadata -- PARQUET:field_id: '1' user_name: string -- field metadata -- PARQUET:field_id: '2' status: string -- field metadata -- PARQUET:field_id: '3' ... ``` It seems even with #6997 merged, we still cannot fully support the ORC file format if we use pyarrow to read it. The name mapping feature may need to be implemented here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
