[GitHub] [iceberg] Fokko commented on a diff in pull request #6997: Python: Infer Iceberg schema from the Parquet file

via GitHub Sat, 11 Mar 2023 14:37:55 -0800


Fokko commented on code in PR #6997:
URL: https://github.com/apache/iceberg/pull/6997#discussion_r1133156361



##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -356,14 +366,19 @@ def field(self, field: NestedField, field_result: 
pa.DataType) -> pa.Field:
             name=field.name,
             type=field_result,
             nullable=field.optional,
-            metadata={"doc": field.doc, "id": str(field.field_id)} if 
field.doc else {},
+            metadata={PYTHON_DOC.decode(): field.doc, 
PYTHON_FIELD_ID.decode(): str(field.field_id)}
+            if field.doc
+            else {PYTHON_FIELD_ID.decode(): str(field.field_id)},

Review Comment:
   I would probably get rid of the `PARQUET:` prefix when reading the data. It 
looks like this is something that PyArrow introduces. But when we write the 
fields again, we definitely don't want to have that prefix there, otherwise 
next time it would read as `PARQUET:PARQUET:field_id` etc :)
   
   I checked with duckdb, and it shows just `field_id`:
   ```
   ➜  Desktop duckdb
   v0.7.1 b00b93f0b1
   Enter ".help" for usage hints.
   Connected to a transient in-memory database.
   Use ".open FILENAME" to reopen on a persistent database.
   D SELECT * FROM 
PARQUET_SCHEMA('/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet');
   
┌──────────────────────┬──────────────────────┬────────────┬─────────────┬───┬───────┬───────────┬──────────┬──────────────────────┐
   │      file_name       │         name         │    type    │ type_length │ … 
│ scale │ precision │ field_id │     logical_type     │
   │       varchar        │       varchar        │  varchar   │   varchar   │   
│ int64 │   int64   │  int64   │       varchar        │
   
├──────────────────────┼──────────────────────┼────────────┼─────────────┼───┼───────┼───────────┼──────────┼──────────────────────┤
   │ /Users/fokkodriesp…  │ table                │ BOOLEAN    │ 0           │ … 
│     0 │         0 │        0 │                      │
   │ /Users/fokkodriesp…  │ VendorID             │ INT64      │ 0           │ … 
│     0 │         0 │        1 │                      │
   │ /Users/fokkodriesp…  │ tpep_pickup_datetime │ INT64      │ 0           │ … 
│     0 │         0 │        2 │ TimestampType(isAd…  │
   │ /Users/fokkodriesp…  │ tpep_dropoff_datet…  │ INT64      │ 0           │ … 
│     0 │         0 │        3 │ TimestampType(isAd…  │
   │ /Users/fokkodriesp…  │ passenger_count      │ DOUBLE     │ 0           │ … 
│     0 │         0 │        4 │                      │
   │ /Users/fokkodriesp…  │ trip_distance        │ DOUBLE     │ 0           │ … 
│     0 │         0 │        5 │                      │
   │ /Users/fokkodriesp…  │ RatecodeID           │ DOUBLE     │ 0           │ … 
│     0 │         0 │        6 │                      │
   │ /Users/fokkodriesp…  │ store_and_fwd_flag   │ BYTE_ARRAY │ 0           │ … 
│     0 │         0 │        7 │ StringType()         │
   │ /Users/fokkodriesp…  │ PULocationID         │ INT64      │ 0           │ … 
│     0 │         0 │        8 │                      │
   │ /Users/fokkodriesp…  │ DOLocationID         │ INT64      │ 0           │ … 
│     0 │         0 │        9 │                      │
   │ /Users/fokkodriesp…  │ payment_type         │ INT64      │ 0           │ … 
│     0 │         0 │       10 │                      │
   │ /Users/fokkodriesp…  │ fare_amount          │ DOUBLE     │ 0           │ … 
│     0 │         0 │       11 │                      │
   │ /Users/fokkodriesp…  │ extra                │ DOUBLE     │ 0           │ … 
│     0 │         0 │       12 │                      │
   │ /Users/fokkodriesp…  │ mta_tax              │ DOUBLE     │ 0           │ … 
│     0 │         0 │       13 │                      │
   │ /Users/fokkodriesp…  │ tip_amount           │ DOUBLE     │ 0           │ … 
│     0 │         0 │       14 │                      │
   │ /Users/fokkodriesp…  │ tolls_amount         │ DOUBLE     │ 0           │ … 
│     0 │         0 │       15 │                      │
   │ /Users/fokkodriesp…  │ improvement_surcha…  │ DOUBLE     │ 0           │ … 
│     0 │         0 │       16 │                      │
   │ /Users/fokkodriesp…  │ total_amount         │ DOUBLE     │ 0           │ … 
│     0 │         0 │       17 │                      │
   │ /Users/fokkodriesp…  │ congestion_surcharge │ DOUBLE     │ 0           │ … 
│     0 │         0 │       18 │                      │
   │ /Users/fokkodriesp…  │ airport_fee          │ DOUBLE     │ 0           │ … 
│     0 │         0 │       19 │                      │
   
├──────────────────────┴──────────────────────┴────────────┴─────────────┴───┴───────┴───────────┴──────────┴──────────────────────┤
   │ 20 rows                                                                    
                                 11 columns (8 shown) │
   
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
   ```
   
   I also double checked with orc, and there it is 
[`field.id`](https://github.com/apache/iceberg/blob/bc650a1cd586b9c2f8063851e6d583533435f31e/orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java#L68):
   
   ```
   ➜  Desktop orc-tools meta 
00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc     
   log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
   log4j:WARN Please initialize the log4j system properly.
   log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
   Processing data file 
00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc [length: 2229]
   Structure for 00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc
   File Version: 0.12 with ORC_14 by ORC Java 1.7.8
   Rows: 7
   Compression: ZLIB
   Compression size: 262144
   Calendar: Julian/Gregorian
   Type: struct<VendorID:bigint,tpep_pickup_datetime:timestamp with local time 
zone,tpep_dropoff_datetime:timestamp with local time 
zone,passenger_count:double,trip_distance:double,RatecodeID:double,store_and_fwd_flag:string,PULocationID:bigint,DOLocationID:bigint,payment_type:bigint,fare_amount:double,extra:double,mta_tax:double,tip_amount:double,tolls_amount:double,improvement_surcharge:double,total_amount:double,congestion_surcharge:double,airport_fee:double>
   Attributes on root.VendorID
     iceberg.id: 1
     iceberg.long-type: LONG
     iceberg.required: false
   Attributes on root.tpep_pickup_datetime
     iceberg.id: 2
     iceberg.required: false
   Attributes on root.tpep_dropoff_datetime
     iceberg.id: 3
     iceberg.required: false
   Attributes on root.passenger_count
     iceberg.id: 4
     iceberg.required: false
   Attributes on root.trip_distance
     iceberg.id: 5
     iceberg.required: false
   Attributes on root.RatecodeID
     iceberg.id: 6
     iceberg.required: false
   Attributes on root.store_and_fwd_flag
     iceberg.id: 7
     iceberg.required: false
   Attributes on root.PULocationID
     iceberg.id: 8
     iceberg.long-type: LONG
     iceberg.required: false
   Attributes on root.DOLocationID
     iceberg.id: 9
     iceberg.long-type: LONG
     iceberg.required: false
   Attributes on root.payment_type
     iceberg.id: 10
     iceberg.long-type: LONG
     iceberg.required: false
   Attributes on root.fare_amount
     iceberg.id: 11
     iceberg.required: false
   Attributes on root.extra
     iceberg.id: 12
     iceberg.required: false
   Attributes on root.mta_tax
     iceberg.id: 13
     iceberg.required: false
   Attributes on root.tip_amount
     iceberg.id: 14
     iceberg.required: false
   Attributes on root.tolls_amount
     iceberg.id: 15
     iceberg.required: false
   Attributes on root.improvement_surcharge
     iceberg.id: 16
     iceberg.required: false
   Attributes on root.total_amount
     iceberg.id: 17
     iceberg.required: false
   Attributes on root.congestion_surcharge
     iceberg.id: 18
     iceberg.required: false
   Attributes on root.airport_fee
     iceberg.id: 19
     iceberg.required: false
   ```
   
   So probably we need to be flexible with the keys anyway. Looking at Ryan's 
original comment: 
https://github.com/apache/iceberg/pull/6997#discussion_r1125740765
   
   I think what he means is that the first check if there are IDs, if so, we 
use the fields that have an ID. If there are no IDs at all and the PyArrow to 
Iceberg visitor returns an empty schema, then we introduce the name mapping 
(but maybe we should just leave that out of scope for this PR?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on a diff in pull request #6997: Python: Infer Iceberg schema from the Parquet file

Reply via email to