Fokko commented on code in PR #6997:
URL: https://github.com/apache/iceberg/pull/6997#discussion_r1133156361
##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -356,14 +366,19 @@ def field(self, field: NestedField, field_result:
pa.DataType) -> pa.Field:
name=field.name,
type=field_result,
nullable=field.optional,
- metadata={"doc": field.doc, "id": str(field.field_id)} if
field.doc else {},
+ metadata={PYTHON_DOC.decode(): field.doc,
PYTHON_FIELD_ID.decode(): str(field.field_id)}
+ if field.doc
+ else {PYTHON_FIELD_ID.decode(): str(field.field_id)},
Review Comment:
I would probably get rid of the `PARQUET:` prefix when reading the data. It
looks like this is something that PyArrow introduces. But when we write the
fields again, we definitely don't want to have that prefix there, otherwise
next time it would read as `PARQUET:PARQUET:field_id` etc :)
I checked with duckdb, and it shows just `field_id`:
```
➜ Desktop duckdb
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT * FROM
PARQUET_SCHEMA('/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet');
┌──────────────────────┬──────────────────────┬────────────┬─────────────┬───┬───────┬───────────┬──────────┬──────────────────────┐
│ file_name │ name │ type │ type_length │ …
│ scale │ precision │ field_id │ logical_type │
│ varchar │ varchar │ varchar │ varchar │
│ int64 │ int64 │ int64 │ varchar │
├──────────────────────┼──────────────────────┼────────────┼─────────────┼───┼───────┼───────────┼──────────┼──────────────────────┤
│ /Users/fokkodriesp… │ table │ BOOLEAN │ 0 │ …
│ 0 │ 0 │ 0 │ │
│ /Users/fokkodriesp… │ VendorID │ INT64 │ 0 │ …
│ 0 │ 0 │ 1 │ │
│ /Users/fokkodriesp… │ tpep_pickup_datetime │ INT64 │ 0 │ …
│ 0 │ 0 │ 2 │ TimestampType(isAd… │
│ /Users/fokkodriesp… │ tpep_dropoff_datet… │ INT64 │ 0 │ …
│ 0 │ 0 │ 3 │ TimestampType(isAd… │
│ /Users/fokkodriesp… │ passenger_count │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 4 │ │
│ /Users/fokkodriesp… │ trip_distance │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 5 │ │
│ /Users/fokkodriesp… │ RatecodeID │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 6 │ │
│ /Users/fokkodriesp… │ store_and_fwd_flag │ BYTE_ARRAY │ 0 │ …
│ 0 │ 0 │ 7 │ StringType() │
│ /Users/fokkodriesp… │ PULocationID │ INT64 │ 0 │ …
│ 0 │ 0 │ 8 │ │
│ /Users/fokkodriesp… │ DOLocationID │ INT64 │ 0 │ …
│ 0 │ 0 │ 9 │ │
│ /Users/fokkodriesp… │ payment_type │ INT64 │ 0 │ …
│ 0 │ 0 │ 10 │ │
│ /Users/fokkodriesp… │ fare_amount │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 11 │ │
│ /Users/fokkodriesp… │ extra │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 12 │ │
│ /Users/fokkodriesp… │ mta_tax │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 13 │ │
│ /Users/fokkodriesp… │ tip_amount │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 14 │ │
│ /Users/fokkodriesp… │ tolls_amount │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 15 │ │
│ /Users/fokkodriesp… │ improvement_surcha… │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 16 │ │
│ /Users/fokkodriesp… │ total_amount │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 17 │ │
│ /Users/fokkodriesp… │ congestion_surcharge │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 18 │ │
│ /Users/fokkodriesp… │ airport_fee │ DOUBLE │ 0 │ …
│ 0 │ 0 │ 19 │ │
├──────────────────────┴──────────────────────┴────────────┴─────────────┴───┴───────┴───────────┴──────────┴──────────────────────┤
│ 20 rows
11 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```
I also double checked with orc, and there it is
[`field.id`](https://github.com/apache/iceberg/blob/bc650a1cd586b9c2f8063851e6d583533435f31e/orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java#L68):
```
➜ Desktop orc-tools meta
00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Processing data file
00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc [length: 2229]
Structure for 00006-220-98bb5767-2804-4986-bc98-f48e94365b84-00001.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.8
Rows: 7
Compression: ZLIB
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<VendorID:bigint,tpep_pickup_datetime:timestamp with local time
zone,tpep_dropoff_datetime:timestamp with local time
zone,passenger_count:double,trip_distance:double,RatecodeID:double,store_and_fwd_flag:string,PULocationID:bigint,DOLocationID:bigint,payment_type:bigint,fare_amount:double,extra:double,mta_tax:double,tip_amount:double,tolls_amount:double,improvement_surcharge:double,total_amount:double,congestion_surcharge:double,airport_fee:double>
Attributes on root.VendorID
iceberg.id: 1
iceberg.long-type: LONG
iceberg.required: false
Attributes on root.tpep_pickup_datetime
iceberg.id: 2
iceberg.required: false
Attributes on root.tpep_dropoff_datetime
iceberg.id: 3
iceberg.required: false
Attributes on root.passenger_count
iceberg.id: 4
iceberg.required: false
Attributes on root.trip_distance
iceberg.id: 5
iceberg.required: false
Attributes on root.RatecodeID
iceberg.id: 6
iceberg.required: false
Attributes on root.store_and_fwd_flag
iceberg.id: 7
iceberg.required: false
Attributes on root.PULocationID
iceberg.id: 8
iceberg.long-type: LONG
iceberg.required: false
Attributes on root.DOLocationID
iceberg.id: 9
iceberg.long-type: LONG
iceberg.required: false
Attributes on root.payment_type
iceberg.id: 10
iceberg.long-type: LONG
iceberg.required: false
Attributes on root.fare_amount
iceberg.id: 11
iceberg.required: false
Attributes on root.extra
iceberg.id: 12
iceberg.required: false
Attributes on root.mta_tax
iceberg.id: 13
iceberg.required: false
Attributes on root.tip_amount
iceberg.id: 14
iceberg.required: false
Attributes on root.tolls_amount
iceberg.id: 15
iceberg.required: false
Attributes on root.improvement_surcharge
iceberg.id: 16
iceberg.required: false
Attributes on root.total_amount
iceberg.id: 17
iceberg.required: false
Attributes on root.congestion_surcharge
iceberg.id: 18
iceberg.required: false
Attributes on root.airport_fee
iceberg.id: 19
iceberg.required: false
```
So probably we need to be flexible with the keys anyway. Looking at Ryan's
original comment:
https://github.com/apache/iceberg/pull/6997#discussion_r1125740765
I think what he means is that the first check if there are IDs, if so, we
use the fields that have an ID. If there are no IDs at all and the PyArrow to
Iceberg visitor returns an empty schema, then we introduce the name mapping
(but maybe we should just leave that out of scope for this PR?).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]