[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6997: Python: Infer Iceberg schema from the Parquet file

via GitHub Sat, 11 Mar 2023 20:00:26 -0800


JonasJ-ap commented on code in PR #6997:
URL: https://github.com/apache/iceberg/pull/6997#discussion_r1133182512



##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -356,14 +366,19 @@ def field(self, field: NestedField, field_result: 
pa.DataType) -> pa.Field:
             name=field.name,
             type=field_result,
             nullable=field.optional,
-            metadata={"doc": field.doc, "id": str(field.field_id)} if 
field.doc else {},
+            metadata={PYTHON_DOC.decode(): field.doc, 
PYTHON_FIELD_ID.decode(): str(field.field_id)}
+            if field.doc
+            else {PYTHON_FIELD_ID.decode(): str(field.field_id)},

Review Comment:
   Thank you very much for such a detailed explanation. I was thinking of the 
"pre-processing PyArrow schema" approach mentioned by Ryan. But now I see the 
problem of the prefix and I agree that we should leave name mapping for another 
PR. 
   
   I refactored the `_get_field_id_and_doc` to search for key ended with `id` 
such that both PARQUET and ORC could be handled. I also checked the AVRO format 
and found that the related key name is `field-id` so it also fits:
   ```python
   (base) ➜  user_id_bucket=2 avro-tools getschema 
20230312_025043_00060_6msjy-bf8f6d3b-7e93-4f25-8a77-d0c183d6771a.avro
   23/03/11 21:55:31 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   {
     "type" : "record",
     "name" : "table",
     "fields" : [ {
       "name" : "user_id",
       "type" : [ "null", "int" ],
       "default" : null,
       "field-id" : 1
     }
   ```
   Please let me know if something like `key.decode().endswith(FIELD_ID)` is 
too flexible in this case. (I'm assuming that the iceberg data files' field 
metadata does not contain too many items and there will only be one key 
associated with id)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6997: Python: Infer Iceberg schema from the Parquet file

Reply via email to