[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

GitBox Tue, 28 Jun 2022 07:57:11 -0700


RussellSpitzer commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r908589536



##########
format/spec.md:
##########
@@ -486,16 +486,17 @@ When reading v1 manifests with no sequence number column, 
sequence numbers for a
 
 A snapshot consists of the following fields:
 
-| v1         | v2         | Field                    | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
-|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata |
-| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created |
+| v1         | v2         | Field                    | Description             
                                                                                
                                                        |
+| ---------- | ---------- | ------------------------ 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID        
                                                                                
                                                        |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent                      
                                                         |
+|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table                     
                                                                |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection          
                                                           |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata                                                           |
+| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present                        
                                                        |
+| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below)              
                                                              |
+| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created                                    
                                                              |
+| _optional_ | _optional_ | **`statistics`**         | A [statistics file's 
metadata](#statistics-file). The field should be retained by writers, unless 
writer updates the statistics, or knows they became obsolete. |

Review Comment:
   I don't think Snapshots is the right place to hold them @rdblue , I just am 
not sure how we are traversing ancestors in this case. Are you suggesting we 
have a list of stats files, each of which is associated with a single snapshot 
file. Then we we traverse our ancestors in the snapshot tree to find stats 
files? 
   
   When we don't have a stats file for the current snapshot (or snapshot in 
question) do we return no stats files? I think that would make sense and 
instead we rely on a metadata only operation for associating old stats files 
with newer snapshots.
   
   For example, older or non-puffin capable writer adds a snapshot B to 
existing table with Snapshot A
   
   ```
   Table : { snaps {A, B} , stats { A -> A' } }
   ```
   
   When reading B I think we have two options
   
   1. Report there are no stats files associated with B
   2. Use ancestor, find A -> A' and then attempt to determine if the operation 
to make B invalidated A'
   
   I think 1. is probably safest and then we would add an operation for a 
client that understands A', `propagate_stats` or something which would add a 
new metadata.json
   ```
   Table {snaps {A, B}, stats { A -> A', B -> A'}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

Reply via email to