Re: [PR] Spec: Add content stats to spec [iceberg]

via GitHub Wed, 06 May 2026 14:57:42 -0700


rdblue commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3197745674



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**
+
+Columns whose ids fall in the [reserved field ID](#reserved-field-ids) space 
use a different base so their stats ids do not overlap data columns:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - 
table_field_id)))`

Review Comment:
   I completely agree with @stevenzwu here. I was confused by the 
implementation in `StatsUtil` and eventually figured out that it is just 
getting a value between 0 and 200 for the reserved ID. We can do that much more 
easily using the starting ID value that is already explicit in the spec: 
`2_147_483_447`.
   
   I also think this is unnecessary because I would avoid this complexity for 
assigning IDs. See my other comments about only needing stats for a couple of 
metadata fields, like `_row_id`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spec: Add content stats to spec [iceberg]

Reply via email to