[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add statistics information in table snapshot

GitBox Mon, 06 Jun 2022 11:29:55 -0700


rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r890422795



##########
format/spec.md:
##########
@@ -513,6 +514,17 @@ Manifests for a snapshot are tracked by a manifest list.
 
 Valid snapshots are stored as a list in table metadata. For serialization, see 
Appendix C.
 
+Statistics files' metadata within `statistics` field is a struct with the 
following fields:
+
+| Field name                      | Type                               | 
Description                                                                     
                                                                                
                                                         |
+|---------------------------------|------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **`location`**                  | `string`                           | 
Location of the statistics file. See [Puffin file format](../puffin).           
                                                                                
                                                         |
+| **`file-size-in-bytes`**        | `long`                             | Size 
of the statistics file.                                                         
                                                                                
                                                    |
+| **`file-footer-size-in-bytes`** | `long`                             | Size 
of the statistics file's footer. See [Puffin file format](../puffin) for footer 
definition.                                                                     
                                                    |
+| **`source-sequence-number`**    | `long`                             | Table 
sequence number at which the stats were calculated                              
                                                                                
                                                   |
+| **`statistics-fields-sets`**    | `map<string, list<list<integer>>>` | A map 
indicating which statistics are contained in the statistics file and on which 
columns they were calculated. The map keys are statistics sketch names and map 
values represent sets of columns, given by column ID. |

Review Comment:
   I think this should include more information and be a bit easier to 
understand. Because this uses a map, you have to also use a list of lists of 
integers in case there is more than one set of columns covered by a blob. 
Instead, I think it is better to make the metadata a list of objects. That has 
a couple advantages:
   * There can be one object per field set rather than combining by blob name
   * The objects are extensible and can carry more information, like the NDV 
estimate (to avoid reading the file entirely)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #4945: Add statistics information in table snapshot

Reply via email to