Re: [PR] Spec: Add content stats to spec [iceberg]

via GitHub Fri, 15 May 2026 13:47:40 -0700


stevenzwu commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3250801661



##########
format/spec.md:
##########
@@ -704,11 +727,133 @@ Examples of valid field paths using normalized JSON path 
format are:
 * `$['tags']` -- the `tags` array
 * `$['addresses']['zip']` -- the `zip` field in an `addresses` array that 
contains objects
 
-For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are 
both points of the following coordinates X, Y, Z, and M (see Appendix G) which 
are the lower / upper bound of all objects in the file.
+##### Content Stats
 
-For `geography` only, xmin (X value of `lower_bounds`) may be greater than 
xmax (X value of `upper_bounds`), in which case an object in this bounding box 
may match if it contains an X such that x >= xmin OR x <= xmax. In geographic 
terminology, the concepts of xmin, xmax, ymin, and ymax are also known as 
westernmost, easternmost, southernmost and northernmost, respectively. These 
points are further restricted to the canonical ranges of [-180..180] for X and 
[-90..90] for Y.
+In Iceberg v4, statistics are stored in typed fields grouped in a struct that 
corresponds to the table field. These stats structs are nested within the 
`content_stats` struct in manifest files.
 
-When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
+###### Field Statistics
+
+Field-level structs in `content_stats` are based on the corresponding table 
field's type, requirement, and ID (`field-id`).
+
+Field stats structs are assigned a range of 200 IDs, starting at `10_000 + 200 
* field-id`. The first ID in the range (`base-id`) is the ID of the struct 
field in `content_stats`. Fields within the stats struct are assigned IDs from 
the range by adding an offset to the `base-id`. For example, the stats struct 
for table field 2 uses IDs in the range `[10_400, 10_599]`, the field within 
`content_stats` uses the `base-id`, ID `10_400`, and its `lower_bound` field 
(offset 1) uses ID `10_401`.
+
+Content stats must be resolved by ID; field names used for stats structs are 
informational. The recommended name for each field is the full name of the 
field in the table schema.
+
+IDs in the range `10_000` (inclusive) to `200_000_000` (exclusive) are 
reserved for column stats structs in `content_stats`. Stats for table fields 
with stats IDs outside that range cannot be stored in `content_stats`.
+
+[Reserved metadata fields](#reserved-field-ids) must use the stats ID ranges 
from the following table. Stats for metadata fields not in the table are not 
tracked.
+
+| Reserved field                  | ID         | `base-id` | Range end |
+|---------------------------------|------------|-----------|-----------|
+| `_last_updated_sequence_number` | 2147483539 | 9000      | 9199 |
+| `_row_id`                       | 2147483540 | 9200      | 9399 |
+
+Each stats struct holds statistics for one table field. It may contain the 
following metrics:
+
+| Requirement | Offset | Name                      | Type                      
| Included for                                  | Description |
+|-------------|--------|---------------------------|---------------------------|-----------------------------------------------|-------------|
+| _optional_  | 1      | `lower_bound`             | Field type or `geo_lower` 
| all primitives or `variant`                   | Lower bound stored as the 
field's type, or `geo_lower` for geo types |
+| _optional_  | 2      | `upper_bound`             | Field type or `geo_upper` 
| all primitives or `variant`                   | Upper bound stored as the 
field's type, or `geo_upper` for geo types |
+| _optional_  | 3      | `tight_bounds`            | `boolean`                 
| all except `geometry`, `geography`, `variant` | When true, `lower_bound` and 
`upper_bound` must be equal to the min and max values |
+| _optional_  | 4      | `value_count`             | `long`                    
| all                                           | Number of values in the 
column (including null and NaN values) |
+| _optional_  | 5      | `null_value_count`        | `long`                    
| optional fields                               | Number of null values in the 
column |
+| _optional_  | 6      | `nan_value_count`         | `long`                    
| `float`, `double`                             | Number of NaN values in the 
column |
+| _optional_  | 7      | `avg_value_size_in_bytes` | `int`                     
| `string`, `binary`, `variant`                 | Avg value size (uncompressed) 
in bytes to estimate memory consumption |

Review Comment:
   +1 - two interop-relevant clarifications on this field:
   - Encoding: "uncompressed" alone leaves room to read this as "after Parquet 
dictionary decoding but before page compression". Given the description 
("estimate memory consumption"), the intended meaning is fully decoded value 
bytes. Suggest "uncompressed and unencoded".
   - Denominator: the spec does not say whether the average is over 
`value_count` or over non-null values. Consumers like Spark CBO expect the 
latter. Suggest the description say "average size in bytes per non-null value".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Add content stats to spec [iceberg]

Reply via email to