[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

via GitHub Wed, 06 Sep 2023 07:51:00 -0700


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1317408295



##########
format/spec.md:
##########
@@ -702,6 +703,41 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly.
+Each table snapshot may be associated with at most one partition statistic 
file and the table can contain many partition statistics files associated with 
different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be 
considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |

Review Comment:
   I thought the sequence number is used as a quick alternative to snapshot id 
checkpoint during taking decision about whether to apply deletes or not. 
   
   Same sequence number info is stored for puffin files also along with 
snapshot id. https://iceberg.apache.org/spec/#table-statistics
   
   But I guess your point is we can fetch this info from snapshot anytime and 
why to store again? 
   
   @rdblue suggested this. I think he can add more info. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Reply via email to