rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r906344329
##########
format/spec.md:
##########
@@ -486,16 +486,17 @@ When reading v1 manifests with no sequence number column,
sequence numbers for a
A snapshot consists of the following fields:
-| v1 | v2 | Field | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`** | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the
snapshot's parent. Omitted for any snapshot with no parent |
-| | _required_ | **`sequence-number`** | A monotonically
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`** | A timestamp when the
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`** | The location of a
manifest list for this snapshot that tracks manifest files with additional
metadata |
-| _optional_ | | **`manifests`** | A list of manifest file
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`** | A string map that
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`** | ID of the table's
current schema when the snapshot was created |
+| v1 | v2 | Field | Description
|
+| ---------- | ---------- | ------------------------
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`** | A unique long ID
|
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the
snapshot's parent. Omitted for any snapshot with no parent
|
+| | _required_ | **`sequence-number`** | A monotonically
increasing long that tracks the order of changes to a table
|
+| _required_ | _required_ | **`timestamp-ms`** | A timestamp when the
snapshot was created, used for garbage collection and table inspection
|
+| _optional_ | _required_ | **`manifest-list`** | The location of a
manifest list for this snapshot that tracks manifest files with additional
metadata |
+| _optional_ | | **`manifests`** | A list of manifest file
locations. Must be omitted if `manifest-list` is present
|
+| _optional_ | _required_ | **`summary`** | A string map that
summarizes the snapshot changes, including `operation` (see below)
|
+| _optional_ | _optional_ | **`schema-id`** | ID of the table's
current schema when the snapshot was created
|
+| _optional_ | _optional_ | **`statistics`** | A [statistics file's
metadata](#statistics-file). The field should be retained by writers, unless
writer updates the statistics, or knows they became obsolete. |
Review Comment:
After thinking about this more and talking with @danielcweeks, I think that
stats files should be tracked by a new top-level `stats` field in
`TableMetadata`.
The issues with tracking stats in a snapshot are:
1. There is now a rule about carrying forward stats files that all writers
must implement. This makes it difficult to manage an environment with multiple
writer versions. The stats files will be dropped by some writers, and there
isn't a good way to detect whether omitting stats was intentional (they are no
longer accurate) or accidental (the writer didn't know about stats).
2. Adding or updating stats creates a new snapshot. The new snapshot has no
data changes, but will cause retries and validations for concurrent writers.
A top-level `stats` field addresses the writer challenge because modifying
stats files is separate from adding a snapshot and updating a ref. The catalog
can detect the difference between removing a stats file (`stats` list exists)
and accidentally dropping stats (`stats` list is omitted).
This also allows us to add to the Table API, rather than adding to the
Snapshot API. The table API is better because matches are fuzzy -- there isn't
necessarily a stats file that was calculated for a given snapshot so we don't
want to associated a snapshot 1-to-1 with a stats file. (Side benefit: we don't
copy all the stats metadata into each snapshot!)
The main drawback is that this would need to contain more than one stats
file per branch in the list in order to use the stats file that was current as
of a snapshot. But there is a benefit to this as well because you can always
calculate old stats files and add them. For example, you can calculate stats
for all of the tags in a table after they are tagged, even if the tagged
snapshot itself didn't have stats originally.
Over all, I'm more and more confident that a top-level `stats` field is the
right way to go.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]