rdblue commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r929209107
##########
format/spec.md:
##########
@@ -665,9 +665,34 @@ Table metadata consists of the following fields:
| _optional_ | _required_ | **`sort-orders`**| A list of sort orders, stored
as full sort order objects. |
| _optional_ | _required_ | **`default-sort-order-id`**| Default sort order id
of the table. Note that this could be used by writers, but is not used when
reading because reads use the specs stored in manifest files. |
| | _optional_ | **`refs`** | A map of snapshot references. The map
keys are the unique snapshot reference names in the table, and the map values
are snapshot reference objects. There is always a `main` branch reference
pointing to the `current-snapshot-id` even if the `refs` map is null. |
+| _optional_ | _optional_ | **`snapshot-statistics`** | A list (optional) of
[table statistics](#table-statistics). |
For serialization details, see Appendix C.
+#### Table statistics
+
+Table statistics files are valid [Puffin files](../puffin-spec). Statistics
are informational. A reader can choose to
+ignore statistics information. Statistics support is not required to read the
table correctly. A table can contain
+many statistics files associated with different table snapshots.
+
+Statistics files metadata within `snapshot-statistics` table metadata field is
a struct with the following fields:
+
+| v1 | v2 | Field name | Type
| Description
|
+|------------|------------|---------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`snapshot-id`** | `string`
| ID of the Iceberg table's snapshot the statistics were
computed from.
|
+| _required_ | _required_ | **`statistics-path`** | `string`
| Path of the statistics file. See [Puffin file
format](../puffin-spec).
|
+| _required_ | _required_ | **`file-size-in-bytes`** | `long`
| Size of the statistics file.
|
+| _required_ | _required_ | **`file-footer-size-in-bytes`** | `long`
| Total size of the statistics file's footer (not the footer
payload size). See [Puffin file format](../puffin-spec) for footer definition. |
+| _required_ | _required_ | **`blob-metadata`** | `list<blob
metadata>` (see below) | A list of the blob metadata for statistics contained
in the file with structure described below.
|
+
+Blob metadata is a struct with the following fields:
+
+| v1 | v2 | Field name | Type |
Description
|
+|------------|------------|------------------|-----------------------|----------------------------------------------------------------------------------------------------|
+| _required_ | _required_ | **`type`** | `string` | Type of
the blob. Matches Blob type in the Puffin file.
|
+| _required_ | _required_ | **`fields`** | `list<integer>` | Ordered
list of fields, given by field ID, on which the statistic was calculated.
|
Review Comment:
Right now, blob metadata includes the snapshot ID and sequence number. I
think that should be included here until we decide to remove them from blob
metadata in Puffin files. These should mostly match.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]