yshcz opened a new issue, #14926:
URL: https://github.com/apache/iceberg/issues/14926

   ### Feature Request / Improvement
   
   The current spec defines Avro file metadata requirements for manifest files 
in a clear table:
   
     | v1         | v2         | Key                 | Value                    
                                                    |
     
|------------|------------|---------------------|------------------------------------------------------------------------------|
     | _required_ | _required_ | `schema`            | JSON representation of 
the table schema at the time the manifest was written |
     | _optional_ | _required_ | `schema-id`         | ID of the schema used to 
write the manifest as a string                      |
     | _required_ | _required_ | `partition-spec`    | JSON representation of 
the partition spec used to write the manifest         |
     | _optional_ | _required_ | `partition-spec-id` | ID of the partition spec 
used to write the manifest as a string              |
     | _optional_ | _required_ | `format-version`    | Table format version 
number of the manifest as a string                      |
     |            | _required_ | `content`           | Type of content files 
tracked by the manifest: "data" or "deletes"           |
   
   But manifest **list** files have no equivalent specification for their Avro 
metadata, despite the Java implementation writing metadata such as 
`format-version`, `snapshot-id`, `parent-snapshot-id`, and `sequence-number` to 
manifest list files since 2020.
   
   For manifests: #913 added `format-version` to code (2020-04), and #1499 
added the spec (2020-10).
   For manifest lists: #907 added `format-version` to code (2020-04), but there 
are no corresponding spec changes.
   
   As a result, implementations have no standard way to detect the format 
version from a manifest list file alone. They are forced to either infer the 
version based on the presence of certain fields, or simply trust the table 
metadata version. The latter is unreliable in upgrade scenarios where a v2 
table may contain v1 snapshots, introducing unnecessary complexity.
   
   The following table might be a reasonable addition, though I'm not entirely 
certain about the requirements:
   
   | v1         | v2         | v3         | Key                   | Value       
                                         |
   
|------------|------------|------------|-----------------------|------------------------------------------------------|
   | _required_ | _required_ | _required_ | `snapshot-id`         | The 
snapshot ID for this manifest list as a string   |
   | _required_ | _required_ | _required_ | `parent-snapshot-id`  | The parent 
snapshot ID as a string                   |
   |            | _required_ | _required_ | `sequence-number`     | The 
sequence number of the snapshot as a string      |
   |            |            | _required_ | `first-row-id`        | The first 
row ID for row lineage as a string         |
   | _optional_ | _required_ | _required_ | `format-version`      | Table 
format version number as a string              |
   
   
   ### Query engine
   
   None
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [x] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to