jschroed5 commented on issue #34766: URL: https://github.com/apache/beam/issues/34766#issuecomment-2836856224
I think the underlying issue is that field ID numbering is not defined by the [Iceberg spec](https://iceberg.apache.org/spec/#identifier-field-ids) - it's left as an implementation detail, as long as field IDs are unique. An Iceberg catalog service doesn't have to respect the `field_id` assignments that are provided by the client in a create table call. It can choose any assignment ordering it wants at table creation time _(I'd love to have someone from the Iceberg project chime in and confirm this)._ For example, Nessie assigns IDs at table creation time in a depth-first manner for its REST catalog, based on [this code](https://github.com/projectnessie/nessie/blob/main/catalog/format/iceberg/src/main/java/org/projectnessie/catalog/formats/iceberg/nessie/NessieModelIceberg.java#L1298). _(I originally thought this was a `pyiceberg` problem because I was creating a Nessie REST table from `pyiceberg`, then trying to read it with Beam)._ When reading an Iceberg table source, Beam is currently discarding the schema-provided field IDs in `IcebergIO.ReadRows` [here](https://github.com/apache/beam/blob/master/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java#L681), and re-creating them with breadth-first ordering from the Beam row `Schema` object in `ScanTaskReader` [here](https://github.com/apache/beam/blob/master/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ScanTaskReader.java#L68). No matter what approach Beam takes in re-generating IDs (it was depth-first until [this PR](https://github.com/apache/beam/pull/32095) last year), it will fail to read some portion of Iceberg tables... because it shouldn't be re-generating IDs in the first place. Instead, Beam should preserve the table's field ID assignments it retrieves in the Iceberg Schema object from the catalog service instead of discarding them in `icebergSchemaToBeamSchema` and creating new ones in `beamSchemaToIcebergSchema`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org