nastra commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r2715536169
##########
format/spec.md:
##########
@@ -707,6 +707,91 @@ For `geography` only, xmin (X value of `lower_bounds`) may
be greater than xmax
When calculating upper and lower bounds for `geometry` and `geography`, null
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN)
contributes a value to X but no values to Y, Z, or M dimension bounds. If a
dimension has only null or NaN values, that dimension is omitted from the
bounding box. If either the X or Y dimension is missing then the bounding box
itself is not produced.
+#### Content Stats
+
+Content stats have been introduced with v4 and hold stats in a
`struct<struct<...>>` where each nested struct holds the stats for an
individual field of a table. The different field stats types are defined in the
next section.
+
+##### Field Stats Types
+
+The struct that holds individual stats for a particular field of a table
consists of the following fields:
+
+| Name | Type | Offset from field ID of base struct
| required | Description
|
+|------------------|---------------------|-------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| value_count | `long` | 1
| false | Number of values in the column (including null and NaN values)
|
+| null_value_count | `long` | 2
| false | Number of null values in the column
|
+| nan_value_count | `long` | 3
| false | Number of NaN values in the column
|
+| avg_value_count | `int` | 4
| false | The avg value count for variable-length types (string/binary)
|
+| max_value_count | `long` | 5
| false | The max value count for variable-length types (string/binary)
|
+| lower_bound | type of table field | 6
| false | Lower bound in the column serialized as the type of the column
itself. Each value must be less than or equal to all non-null, non-NaN values
in the column for the file [2] |
+| upper_bound | type of table field | 7
| false | Upper bound in the column serialized as the type of the column
itself. Each value must be greater than or equal to all non-null, non-NaN
values in the column for the file [2] |
+| exact_bounds | `boolean` | 8
| false | Whether the `upper_bound` / `lower_bound` is exact or not
|
Review Comment:
```suggestion
| exact_bounds | `boolean` | 8
| false | Whether the `upper_bound` / `lower_bound` is exact or not. Types
such as string/binary can't have exact bounds. Additionally, if a DV or an
equality delete matches a given data file, then `exact_bounds` must be treated
as `false` |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]