[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

GitBox Fri, 24 Jun 2022 12:25:11 -0700


rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r906344329



##########
format/spec.md:
##########
@@ -486,16 +486,17 @@ When reading v1 manifests with no sequence number column, 
sequence numbers for a
 
 A snapshot consists of the following fields:
 
-| v1         | v2         | Field                    | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
-|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata |
-| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created |
+| v1         | v2         | Field                    | Description             
                                                                                
                                                        |
+| ---------- | ---------- | ------------------------ 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID        
                                                                                
                                                        |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent                      
                                                         |
+|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table                     
                                                                |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection          
                                                           |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata                                                           |
+| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present                        
                                                        |
+| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below)              
                                                              |
+| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created                                    
                                                              |
+| _optional_ | _optional_ | **`statistics`**         | A [statistics file's 
metadata](#statistics-file). The field should be retained by writers, unless 
writer updates the statistics, or knows they became obsolete. |

Review Comment:
   After thinking about this more and talking with @danielcweeks, I think that 
stats files should be tracked by a new top-level `stats` field in 
`TableMetadata`.
   
   The issues with tracking stats in a snapshot are:
   1. There is now a rule about carrying forward stats files that all writers 
must implement. This makes it difficult to manage an environment with multiple 
writer versions. The stats files will be dropped by some writers, and there 
isn't a good way to detect whether omitting stats was intentional (they are no 
longer accurate) or accidental (the writer didn't know about stats).
   2. Adding or updating stats creates a new snapshot. The new snapshot has no 
data changes, but will cause retries and validations for concurrent writers.
   
   A top-level `stats` field addresses the writer challenge because modifying 
stats files is separate from adding a snapshot and updating a ref. The catalog 
can detect the difference between removing a stats file (`stats` list exists) 
and accidentally dropping stats (`stats` list is omitted).
   
   This also allows us to add to the Table API, rather than adding to the 
Snapshot API. The table API is better because matches are fuzzy -- there isn't 
necessarily a stats file that was calculated for a given snapshot so we don't 
want to associated a snapshot 1-to-1 with a stats file. (Side benefit: we don't 
copy all the stats metadata into each snapshot!)
   
   The main drawback is that this would need to contain more than one stats 
file per branch in the list in order to use the stats file that was current as 
of a snapshot. But there is a benefit to this as well because you can always 
calculate old stats files and add them. For example, you can calculate stats 
for all of the tags in a table after they are tagged, even if the tagged 
snapshot itself didn't have stats originally.
   
   Over all, I'm more and more confident that a top-level `stats` field is the 
right way to go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

Reply via email to