[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

GitBox Thu, 23 Jun 2022 14:29:45 -0700


rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r905485938



##########
format/spec.md:
##########
@@ -486,16 +486,17 @@ When reading v1 manifests with no sequence number column, 
sequence numbers for a
 
 A snapshot consists of the following fields:
 
-| v1         | v2         | Field                    | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
-|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata |
-| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created |
+| v1         | v2         | Field                    | Description             
                                                                                
                                                        |
+| ---------- | ---------- | ------------------------ 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID        
                                                                                
                                                        |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent                      
                                                         |
+|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table                     
                                                                |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection          
                                                           |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata                                                           |
+| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present                        
                                                        |
+| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below)              
                                                              |
+| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created                                    
                                                              |
+| _optional_ | _optional_ | **`statistics`**         | A [statistics file's 
metadata](#statistics-file). The field should be retained by writers, unless 
writer updates the statistics, or knows they became obsolete. |

Review Comment:
   This is a good point, but there are tradeoffs in both directions.
   
   If we attach a stats file at the table level, then we don't need to add 
logic for carrying it through from one snapshot to the next. That's nice 
because older writers could be unaware of stats and not propagate the file. 
That means that clients would need to look through commit history to find the 
most recent stats file attached to an ancestor snapshot. In short, there's some 
compatibility work to do here if we use snapshot. On the other hand, if we 
attach stats to snapshots, then we can have stats at the branch level.
   
   We could design a new table-level structure that tracks stats files per 
branch, something like this:
   
   ```json
   "stats": [
       { "ref": "main", "last-updated-snapshot-id": ..., "stats-path": "...", 
"file-size-in-bytes": ..., "footer-size-in-bytes": ..., blob-metadata: [ ... ] 
},
       { "ref": "test", "last-updated-snapshot-id": ..., "stats-path": "...", 
"file-size-in-bytes": ..., "footer-size-in-bytes": ..., blob-metadata: [ ... ] }
     ]
   ```
   
   I would probably prefer that new structure, or something similar. What do 
you think, @findepi and @flyrain?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

Reply via email to