[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

via GitHub Tue, 05 Sep 2023 21:37:06 -0700


szehon-ho commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1316589084



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be 
considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the 
**table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first 
partition column from `partition`
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring 
all partition tuples are present.

Review Comment:
   Not sure I understand this sentence, did we mean: 'ensuring all partition 
tuples are present.' => 'ensuring that statistics for all partitions are 
present'?



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be 
considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the 
**table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first 
partition column from `partition`

Review Comment:
   Did we consider sorting by entire partition struct?  (Why sort only by first 
column?)



##########
format/spec.md:
##########
@@ -702,6 +703,51 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot, 
+it must be accurate and must be registered in the table metadata file to be 
considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the 
**table default format**. 
+These rows are sorted based on the first partition column from `partition`. 
+Each unique partition tuple must have exactly one corresponding row, ensuring 
all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following 
fields:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the partition spec output using partition field ids for 
the struct field ids |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of 
records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data 
files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | 
Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | 
Count of records in position delete files |

Review Comment:
   Can we keep existing fields to match partition table, and then add a 
record_count field that is the precomputed accurate value?



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain
+many partition statistics files associated with different table snapshots.
+A writer can optionally write the partition statistics file during each write 
operation. If the statistics file is written for the specific snapshot,
+it must be accurate and must be registered in the table metadata file to be 
considered as a valid statistics file for the reader.
+
+Partition statistics files metadata within `partition-statistics` table 
metadata field is a struct with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+| _required_ | _required_ | **`max-data-sequence-number`** | `long` | Maximum 
data sequence number of the Iceberg table's snapshot the partition statistics 
was computed from. |
+
+#### Partition Statistics file
+
+Statistics information for every partition tuple is stored as a row in the 
**table default format**.
+These rows are sorted (in ascending manner with NULL FIRST) based on the first 
partition column from `partition`
+to optimize filtering rows while scanning.
+Each unique partition tuple must have exactly one corresponding row, ensuring 
all partition tuples are present.
+
+Partition statistics file store the statistics as a struct with the following 
fields:

Review Comment:
   Grammar: "Partition statistics files stores statistics" or "A partition 
statistics file stores statistics..."



##########
format/spec.md:
##########
@@ -702,6 +703,45 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). Partition statistics are informational. A 
reader can choose to
+ignore partition statistics information. Partition statistics support is not 
required to read the table correctly. A table can contain

Review Comment:
   If we add it:  `Each table snapshot may be associated with at most one 
partition statistic file`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7105: Spec: Add partition stats spec

Reply via email to