brunsgaard opened a new issue, #5875:
URL: https://github.com/apache/paimon/issues/5875

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   1.2.0
   
   ### Compute Engine
   
   Flink 1.20.1
   
   ### Minimal reproduce step
   
   I am inspecting the metadata and do not have the ability to reproduce the 
issue myself. I still hope this report is still useful to the community.
   
   ### What doesn't meet your expectations?
   
   When writing Iceberg-compatible manifest files, the generated Avro schema 
currently omits field-id annotations in nested fields such as 
`null_value_counts`. This omission leads to compatibility issues with query 
engines and tools that expect fully compliant Iceberg metadata, including such 
as BigQuery and PyIceberg 
   
   For example, the current schema for null_value_counts appears as from the 
Paimon compat metafile:
   
   ```json
   {
     "name": "null_value_counts",
     "type": [
       "null",
       {
         "type": "array",
         "items": {
           "type": "record",
           "name": "r2_null_value_counts",
           "fields": [
             { "name": "key", "type": "int" },
             { "name": "value", "type": "long" }
           ]
         },
         "logicalType": "map"
       }
     ],
     "default": null
   }
   ```
   
   However, the expected schema (as produced by PyIceberg) includes explicit 
field-id annotations:
   
   ```json
   {
     "name": "null_value_counts",
     "type": [
       "null",
       {
         "type": "array",
         "items": {
           "type": "record",
           "name": "k121_v122",
           "fields": [
             { "name": "key", "type": "int", "field-id": 121 },
             { "name": "value", "type": "long", "field-id": 122 }
           ]
         },
         "logicalType": "map"
       }
     ],
     "doc": "Map of column id to null value count",
     "default": null,
     "field-id": 110
   }
   ```
   
   It would be great if Paimon could update its manifest generation logic to 
include field-id metadata where required by the Iceberg spec. This small change 
would improve out-of-the-box interoperability with downstream engines.
   
   For more see section https://iceberg.apache.org/spec/#avro, 
   > Iceberg struct, list, and map types identify nested types by ID. When 
writing data to Avro files, these IDs must be stored in the Avro schema to 
support ID-based column pruning.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to