jschroed5 commented on issue #34766:
URL: https://github.com/apache/beam/issues/34766#issuecomment-2836856224

   I think the underlying issue is that field ID numbering is not defined by 
the [Iceberg spec](https://iceberg.apache.org/spec/#identifier-field-ids) - 
it's left as an implementation detail, as long as field IDs are unique. An 
Iceberg catalog service doesn't have to respect the `field_id` assignments that 
are provided by the client in a create table call. It can choose any assignment 
ordering it wants at table creation time _(I'd love to have someone from the 
Iceberg project chime in and confirm this)._
   
   For example, Nessie assigns IDs at table creation time in a depth-first 
manner for its REST catalog, based on [this 
code](https://github.com/projectnessie/nessie/blob/main/catalog/format/iceberg/src/main/java/org/projectnessie/catalog/formats/iceberg/nessie/NessieModelIceberg.java#L1298).
 _(I originally thought this was a `pyiceberg` problem because I was creating a 
Nessie REST table from `pyiceberg`, then trying to read it with Beam)._
   
   When reading an Iceberg table source, Beam is currently discarding the 
schema-provided field IDs in `IcebergIO.ReadRows` 
[here](https://github.com/apache/beam/blob/master/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java#L681),
 and re-creating them with breadth-first ordering from the Beam row `Schema` 
object in `ScanTaskReader` 
[here](https://github.com/apache/beam/blob/master/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/ScanTaskReader.java#L68).
 No matter what approach Beam takes in re-generating IDs (it was depth-first 
until [this PR](https://github.com/apache/beam/pull/32095) last year), it will 
fail to read some portion of Iceberg tables... because it shouldn't be 
re-generating IDs in the first place.
   
   Instead, Beam should preserve the table's field ID assignments it retrieves 
in the Iceberg Schema object from the catalog service instead of discarding 
them in `icebergSchemaToBeamSchema` and creating new ones in 
`beamSchemaToIcebergSchema`.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to