[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

via GitHub Wed, 13 Sep 2023 00:38:35 -0700


ajantha-bhat commented on code in PR #7105:
URL: https://github.com/apache/iceberg/pull/7105#discussion_r1324102467



##########
format/spec.md:
##########
@@ -702,6 +703,47 @@ Blob metadata is a struct with the following fields:
 | _optional_ | _optional_ | **`properties`** | `map<string, string>` | 
Additional properties associated with the statistic. Subset of Blob properties 
in the Puffin file. |
 
 
+#### Partition statistics
+
+Partition statistics files are based on [Partition Statistics file 
spec](#partition-statistics-file). 
+Partition statistics are not required for reading or planning and readers may 
ignore them.
+Each table snapshot may be associated with at most one partition statistic 
file.
+A writer can optionally write the partition statistics file during each write 
operation, and 
+it must be registered in the table metadata file to be considered as a valid 
statistics file for the reader.
+
+`partition-statistics` field of table metadata is an optional list of struct 
with the following fields:
+
+| v1 | v2 | Field name | Type | Description |
+|----|----|------------|------|-------------|
+| _required_ | _required_ | **`snapshot-id`** | `long` | ID of the Iceberg 
table's snapshot the partition statistics file is associated with. |
+| _required_ | _required_ | **`statistics-file-path`** | `string` | Path of 
the partition statistics file. See [Partition Statistics 
file](#partition-statistics-file). |
+
+#### Partition Statistics file
+
+Statistics information for each unique partition tuple is stored as a row in 
the default data file format of the table (for example, Parquet or ORC).
+These rows must be sorted (in ascending manner with NULL FIRST) by `partition` 
field to optimize filtering rows while scanning.
+
+The schema of the partition statistics file is as follows:
+
+| v1 | v2 | Field id, name | Type | Description |
+|----|----|----------------|------|-------------|
+| _required_ | _required_ | **`1 partition`** | `struct<..>` | Partition data 
tuple, schema based on the unified partition type considering all specs in a 
table |
+| _required_ | _required_ | **`2 spec_id`** | `int` | Partition spec id |
+| _required_ | _required_ | **`3 data_record_count`** | `long` | Count of 
records in data files |
+| _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data 
files |
+| _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | 
Total size of data files in bytes |
+| _optional_ | _optional_ | **`6 position_delete_record_count`** | `long` | 
Count of records in position delete files |
+| _optional_ | _optional_ | **`7 position_delete_file_count`** | `int` | Count 
of position delete files |
+| _optional_ | _optional_ | **`8 equality_delete_record_count`** | `long` | 
Count of records in equality delete files |
+| _optional_ | _optional_ | **`9 equality_delete_file_count`** | `int` | Count 
of equality delete files |
+| _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate 
count of records in a partition after applying the delete files if any |
+| _optional_ | _optional_ | **`11 last_updated_at`** | `long` | Timestamp in 
milliseconds from the unix epoch when the partition was last updated |
+| _optional_ | _optional_ | **`12 last_updated_snapshot_id`** | `long` | ID of 
snapshot that last updated this partition |
+
+Note that partition data tuple's schema is based on the partition spec output 
using partition field ids for the struct field ids.
+The unified partition type is a struct containing all fields that have ever 
been a part of any spec in the table. 

Review Comment:
   Elaborated the description and added the examples. 
   
   During implementation I will be using existing 
`Partitioning#partitionType()` for this. 
   But Spec should not talk about code. Hence, excluded that info. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #7105: Spec: Add partition stats spec

Reply via email to