[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

GitBox Sun, 08 Jan 2023 10:22:42 -0800


rdblue commented on code in PR #6525:
URL: https://github.com/apache/iceberg/pull/6525#discussion_r1064183618



##########
python/pyiceberg/avro/resolver.py:
##########
@@ -109,38 +109,46 @@ def resolve(
 
 
 class SchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
-    read_types: Optional[Dict[int, Callable[[Schema], StructProtocol]]]
+    read_types: Dict[int, Type[StructProtocol]]
+    context: List[int]
 
-    def __init__(self, read_types: Optional[Dict[int, Callable[[Schema], 
StructProtocol]]]):
+    def __init__(self, read_types: Dict[int, Type[StructProtocol]] = 
EMPTY_DICT) -> None:
         self.read_types = read_types
+        self.context = []
 
     def schema(self, schema: Schema, expected_schema: Optional[IcebergType], 
result: Reader) -> Reader:
         return result
 
+    def before_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+        self.context.append(field.field_id)
+
+    def after_field(self, field: NestedField, field_partner: 
Optional[NestedField]) -> None:
+        self.context.pop()
+
     def struct(self, struct: StructType, expected_struct: 
Optional[IcebergType], field_readers: List[Reader]) -> Reader:
+        # -1 indicates the struct root
+        read_struct_id = self.context[-1] if len(self.context) > 0 else -1
+        struct_callable = self.read_types.get(read_struct_id, Record)

Review Comment:
   I just went to refer back to my later changes, but I see that 
[20b07d35](https://github.com/apache/iceberg/pull/6506/commits/20b07d355f2238580273b83212a6f25ea0d62f21)
 is no longer there. I pushed the changes here: 
https://github.com/apache/iceberg/commit/feef2b6a5d7d33f0818e9ed1e36b31c5442e050d
   
   This PR changes `read_types` by passing in just the type, but this needs to 
pass the read schema to the type. Without the read schema, the positions used 
by `set` won't be correct. For example, if the file contains int columns a, b, 
and c, but the read schema requested b and d, this will produce `((None, 
IntReader), (0, IntReader), (None, IntReader), (1, NoneReader))`. Then the 
struct instance will get calls `set(0, reader.read(decoder))` and `set(1, 
noneReader.read(decoder))`. So the instance needs to know what positions 0 and 
1 are.
   
   There are two solutions to the problem. The first is to pass the read schema 
to the class so it can map the fields, which is what I did in my PR. That's how 
`IndexedRecord` works in the JVM Avro implementation.
   
   The second solution is to use IDs rather than positions. If we did that, 
then some `setById` method on the struct instance would always know what to do. 
This would require a different `StructProtocol` or an extension to it, and I 
think we would still want to have a way to pass the struct type to generic 
`Record` instances... so I think for now it makes sense to go with the position 
approach and pass the schema to the struct instance when it is created.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #6525: Python: Refactor loading manifests

Reply via email to