szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1319945486
##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` |
Additional properties associated with the statistic. Subset of Blob properties
in the Puffin file. |
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file
spec](#partition-statistics-file).
+Partition statistics are not required for reading or planning and readers may
ignore them.
+Each table snapshot may be associated with at most one partition statistic
file.
+A writer can optionally write the partition statistics file during each write
operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid
statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of
the partition statistics file. See [Partition Statistics
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum
data sequence number of the Iceberg table's snapshot the partition statistics
was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in
the default data file format of the table (for example, Parquet or ORC).
+These rows are sorted (in ascending manner with NULL FIRST) based on all
partition columns from `partition` in the same order
Review Comment:
Nit: can we simplify to just
`These rows must be sorted (in ascending manner with NULL FIRST) by
partition to optimize...` ?
##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` |
Additional properties associated with the statistic. Subset of Blob properties
in the Puffin file. |
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file
spec](#partition-statistics-file).
+Partition statistics are not required for reading or planning and readers may
ignore them.
+Each table snapshot may be associated with at most one partition statistic
file.
+A writer can optionally write the partition statistics file during each write
operation. If the statistics file is written for the specific snapshot,
+it must be registered in the table metadata file to be considered as a valid
statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table
metadata field is a struct with the following fields:
Review Comment:
Nit: does not make too much sense, does this suffice?
`Partition statistics files contain a struct `partition-statistics' with the
following fields`
##########
format/spec.md:
##########
@@ -702,6 +703,49 @@ Blob metadata is a struct with the following fields:
| _optional_ | _optional_ | **`properties`** | `map<string, string>` |
Additional properties associated with the statistic. Subset of Blob properties
in the Puffin file. |
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file
spec](#partition-statistics-file).
+Partition statistics are not required for reading or planning and readers may
ignore them.
+Each table snapshot may be associated with at most one partition statistic
file.
+A writer can optionally write the partition statistics file during each write
operation. If the statistics file is written for the specific snapshot,
Review Comment:
Nit: I am not too sure these two sentences add much value, it is the case
for any file reference in Iceberg , isnt it?
```A writer can optionally write the partition statistics file during each
write operation. If the statistics file is written for the specific snapshot,
it must be registered in the table metadata file to be considered as a valid
statistics file for the reader.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]