yshcz opened a new issue, #14926:
URL: https://github.com/apache/iceberg/issues/14926
### Feature Request / Improvement
The current spec defines Avro file metadata requirements for manifest files
in a clear table:
| v1 | v2 | Key | Value
|
|------------|------------|---------------------|------------------------------------------------------------------------------|
| _required_ | _required_ | `schema` | JSON representation of
the table schema at the time the manifest was written |
| _optional_ | _required_ | `schema-id` | ID of the schema used to
write the manifest as a string |
| _required_ | _required_ | `partition-spec` | JSON representation of
the partition spec used to write the manifest |
| _optional_ | _required_ | `partition-spec-id` | ID of the partition spec
used to write the manifest as a string |
| _optional_ | _required_ | `format-version` | Table format version
number of the manifest as a string |
| | _required_ | `content` | Type of content files
tracked by the manifest: "data" or "deletes" |
But manifest **list** files have no equivalent specification for their Avro
metadata, despite the Java implementation writing metadata such as
`format-version`, `snapshot-id`, `parent-snapshot-id`, and `sequence-number` to
manifest list files since 2020.
For manifests: #913 added `format-version` to code (2020-04), and #1499
added the spec (2020-10).
For manifest lists: #907 added `format-version` to code (2020-04), but there
are no corresponding spec changes.
As a result, implementations have no standard way to detect the format
version from a manifest list file alone. They are forced to either infer the
version based on the presence of certain fields, or simply trust the table
metadata version. The latter is unreliable in upgrade scenarios where a v2
table may contain v1 snapshots, introducing unnecessary complexity.
The following table might be a reasonable addition, though I'm not entirely
certain about the requirements:
| v1 | v2 | v3 | Key | Value
|
|------------|------------|------------|-----------------------|------------------------------------------------------|
| _required_ | _required_ | _required_ | `snapshot-id` | The
snapshot ID for this manifest list as a string |
| _required_ | _required_ | _required_ | `parent-snapshot-id` | The parent
snapshot ID as a string |
| | _required_ | _required_ | `sequence-number` | The
sequence number of the snapshot as a string |
| | | _required_ | `first-row-id` | The first
row ID for row lineage as a string |
| _optional_ | _required_ | _required_ | `format-version` | Table
format version number as a string |
### Query engine
None
### Willingness to contribute
- [ ] I can contribute this improvement/feature independently
- [x] I would be willing to contribute this improvement/feature with
guidance from the Iceberg community
- [ ] I cannot contribute this improvement/feature at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]