[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

GitBox Tue, 28 Jun 2022 07:47:27 -0700


rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r908577538



##########
format/spec.md:
##########
@@ -486,16 +486,17 @@ When reading v1 manifests with no sequence number column, 
sequence numbers for a
 
 A snapshot consists of the following fields:
 
-| v1         | v2         | Field                    | Description |
-| ---------- | ---------- | ------------------------ | ----------- |
-| _required_ | _required_ | **`snapshot-id`**        | A unique long ID |
-| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent |
-|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table |
-| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection |
-| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata |
-| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present |
-| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below) |
-| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created |
+| v1         | v2         | Field                    | Description             
                                                                                
                                                        |
+| ---------- | ---------- | ------------------------ 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`**        | A unique long ID        
                                                                                
                                                        |
+| _optional_ | _optional_ | **`parent-snapshot-id`** | The snapshot ID of the 
snapshot's parent. Omitted for any snapshot with no parent                      
                                                         |
+|            | _required_ | **`sequence-number`**    | A monotonically 
increasing long that tracks the order of changes to a table                     
                                                                |
+| _required_ | _required_ | **`timestamp-ms`**       | A timestamp when the 
snapshot was created, used for garbage collection and table inspection          
                                                           |
+| _optional_ | _required_ | **`manifest-list`**      | The location of a 
manifest list for this snapshot that tracks manifest files with additional 
metadata                                                           |
+| _optional_ |            | **`manifests`**          | A list of manifest file 
locations. Must be omitted if `manifest-list` is present                        
                                                        |
+| _optional_ | _required_ | **`summary`**            | A string map that 
summarizes the snapshot changes, including `operation` (see below)              
                                                              |
+| _optional_ | _optional_ | **`schema-id`**          | ID of the table's 
current schema when the snapshot was created                                    
                                                              |
+| _optional_ | _optional_ | **`statistics`**         | A [statistics file's 
metadata](#statistics-file). The field should be retained by writers, unless 
writer updates the statistics, or knows they became obsolete. |

Review Comment:
   @findepi, you would find the stats file written for the closest snapshot. 
That may be a snapshot that was written after: if a compaction runs and then 
stats are calculated for the compacted snapshot it would be closest for the 
snapshot that was compacted.
   
   The file that should be updated is the newest ancestor with stats in a given 
snapshot's history. There are utils for traversing ancestors, so you'd do that 
until you find an ancestor with a stats file.
   
   For the storage, I think we should use a list. Table metadata can be indexed 
however is convenient in memory. Adding requirements for ordering within the 
format is difficult (what happens when one is out of order?) and leads to 
issues with time skew.
   
   @RussellSpitzer, I think it is simple to traverse ancestors to find the last 
stats file. And we can add stats files for older snapshots later if needed (the 
tagging case). I think it is actually much cleaner by not keeping these in 
snapshots.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

Reply via email to