Re: [PR] Spec: Add content stats to spec [iceberg]

via GitHub Fri, 08 May 2026 02:20:01 -0700


nastra commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3207678943



##########
format/spec.md:
##########
@@ -707,6 +714,131 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+In Iceberg v4 stats have been redesigned and are represented by using nested 
structs (`struct<struct<...>>`). The statistics for fields are tracked inside a 
nested struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic mapping from the **table ID space** to 
the **stats ID space**, where a given field ID from the **table ID space** gets 
an ID assigned from the **stats ID space** for each field-level statistics 
struct.
+Each field-level statistic listed in the [field stats types 
section](#field-stats-types) has a fixed offset. Its stats field ID is the 
enclosing stats struct's ID plus that offset.
+
+**Data columns (normal table field ids)**
+Mapping a table field ID from the **table ID space** to the **stats ID space** 
is done via:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. `200` 
represents the number of supports stats per column 
(`num_supported_stats_per_column = 200`).
+
+The formula is defined as:
+`stats_struct_id = stats_space_field_id_start_for_data_fields + 
(num_supported_stats_per_column * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+**Metadata columns (reserved table field ids)**
+
+[Reserved metadata fields](#reserved-field-ids) use a different starting base 
for their stats field ids in order to not overlap with data field stats ids. 
Mapping a reserved table field ID to the **stats ID space** is done via:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - 
table_field_id)))`
+
+Here `2_147_000_000` is `stats_space_field_id_start_for_metadata_fields`. This 
separate base is required because reserved ids are near `Integer.MAX_VALUE` and 
cannot use the same linear mapping as data field ids.
+The first `200` refers to `num_supported_stats_per_column = 200` and the 
second `200` refers to `num_reserved_field_ids = 200` from [Reserved field 
ids](#reserved-field-ids).
+
+The formula is defined as:
+`stats_struct_id = stats_space_field_id_start_for_metadata_fields + 
(num_supported_stats_per_column * (num_reserved_field_ids - (Integer.MAX_VALUE 
- table_field_id)))`
+
+Valid data field ids support stats structs with ids from `10_000` through 
`200_010_000`, so the highest supported **data** field id is `1_000_000`.
+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats` 
struct. Its **name** is the numerical string of the table column's field id 
(for example id `103` uses the name `"103"`).
+Its **field id** is deterministically calculated as defined in the previous 
section. The name is informational and readers must resolve content stats by ID.
+
+###### Field stats types
+
+Each stats struct holds statistics for one table column. It may contain the 
following metrics:
+
+| required/optional | Offset | Name                    | Type                | 
included for            | Description                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                   |
+|-------------------|--------|-------------------------|---------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _optional_        | 1      | value_count             | `long`              | 
all types               | Number of values in the column (including null and 
NaN values)                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      |

Review Comment:
   I think we might want to leave out avg/max value sizes in this case then, 
because we wouldn't be using those right away, unless I'm missing where else 
we'd be immediately using the avg value size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Add content stats to spec [iceberg]

Reply via email to