blackmwk commented on code in PR #2188:
URL: https://github.com/apache/iceberg-rust/pull/2188#discussion_r3362089664


##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -53,6 +54,9 @@ pub type SchemaRef = Arc<Schema>;
 pub const DEFAULT_SCHEMA_ID: SchemaId = 0;
 /// Delimiter for schema name, which denotes a nested struct.
 pub const SCHEMA_NAME_DELIMITER: &str = ".";
+/// Minimum format version that allows non-null field default values.
+/// Mirrors Java's `Schema.DEFAULT_VALUES_MIN_FORMAT_VERSION`.
+pub const MIN_FORMAT_VERSION_DEFAULT_VALUES: FormatVersion = FormatVersion::V3;

Review Comment:
   ```suggestion
   pub(crate) const DEFAULT_VALUES_MIN_FORMAT_VERSION: FormatVersion = 
FormatVersion::V3;
   ```
   This is not a public api.



##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -421,6 +425,85 @@ impl Schema {
     pub fn field_id_to_fields(&self) -> &HashMap<i32, NestedFieldRef> {
         &self.id_to_field
     }
+
+    /// Minimum [`FormatVersion`] required to represent all *types* in this 
schema.
+    ///
+    /// Types only; for initial-default version floors see 
[`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {
+        // `id_to_field` is flattened, so the max over all fields covers 
nested ones too.
+        self.id_to_field
+            .values()
+            .map(|f| leaf_min_format_version(&f.field_type))
+            .max()
+            .unwrap_or(FormatVersion::V1)
+    }
+
+    /// Returns an error listing every field incompatible with 
`format_version`.
+    /// Mirrors Java's `Schema.checkCompatibility()`. Two checks per field:
+    ///
+    /// - **Type** — per `leaf_min_format_version`.
+    /// - **Initial default** — a non-null `initial_default` backfills 
pre-existing rows,
+    ///   so it requires [`MIN_FORMAT_VERSION_DEFAULT_VALUES`]; 
`write_default` is not
+    ///   checked, as it only affects newly written rows (read identically at 
any version).
+    pub fn check_format_compatibility(&self, format_version: FormatVersion) -> 
Result<()> {
+        // (field id, message); sorted by id below for a deterministic error.
+        let mut problems: Vec<(i32, String)> = Vec::new();
+
+        // `id_to_field` is flattened, so checking each field by its own type 
keeps the
+        // blame on the offending leaf, not its container (mirrors Java's 
`lazyIdToField`).
+        for field in self.id_to_field.values() {
+            let min_version = leaf_min_format_version(&field.field_type);
+            if format_version < min_version {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());
+                problems.push((field.id, format!(
+                    "Invalid type for {name}: {} is not supported until 
{min_version} but format version is {format_version}.",
+                    field.field_type,
+                )));
+            }
+
+            if let Some(default) = &field.initial_default
+                && format_version < MIN_FORMAT_VERSION_DEFAULT_VALUES
+            {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());
+                problems.push((field.id, format!(
+                    "Invalid initial default for {name}: non-null default 
({default:?}) is not supported until {MIN_FORMAT_VERSION_DEFAULT_VALUES} but 
format version is {format_version}."
+                )));
+            }
+        }
+
+        if problems.is_empty() {
+            return Ok(());
+        }
+
+        // Stable sort by id: HashMap order is nondeterministic, and stability 
keeps a
+        // field's type problem before its default problem (matches Java's 
TreeMap order).
+        let message = problems
+            .into_iter()
+            .sorted_by_key(|(id, _)| *id)
+            .map(|(_, msg)| msg)
+            .join("\n- ");
+        Err(Error::new(
+            ErrorKind::DataInvalid,
+            format!("Invalid schema for {format_version}:\n- {message}"),
+        ))
+    }
+}
+
+/// Minimum [`FormatVersion`] required by a type itself, ignoring nested 
fields.
+///
+/// `TimestampNs` / `TimestamptzNs` / `Variant` require v3; everything else 
(including
+/// nested types, validated per-leaf elsewhere) is valid from v1. Single 
source of truth
+/// for the type version rules, mirroring Java's `Schema.MIN_FORMAT_VERSIONS`.
+fn leaf_min_format_version(field_type: &Type) -> FormatVersion {

Review Comment:
   This should be part of `Type`.



##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -421,6 +425,85 @@ impl Schema {
     pub fn field_id_to_fields(&self) -> &HashMap<i32, NestedFieldRef> {
         &self.id_to_field
     }
+
+    /// Minimum [`FormatVersion`] required to represent all *types* in this 
schema.
+    ///
+    /// Types only; for initial-default version floors see 
[`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {
+        // `id_to_field` is flattened, so the max over all fields covers 
nested ones too.
+        self.id_to_field
+            .values()
+            .map(|f| leaf_min_format_version(&f.field_type))
+            .max()
+            .unwrap_or(FormatVersion::V1)
+    }
+
+    /// Returns an error listing every field incompatible with 
`format_version`.
+    /// Mirrors Java's `Schema.checkCompatibility()`. Two checks per field:
+    ///
+    /// - **Type** — per `leaf_min_format_version`.
+    /// - **Initial default** — a non-null `initial_default` backfills 
pre-existing rows,
+    ///   so it requires [`MIN_FORMAT_VERSION_DEFAULT_VALUES`]; 
`write_default` is not
+    ///   checked, as it only affects newly written rows (read identically at 
any version).
+    pub fn check_format_compatibility(&self, format_version: FormatVersion) -> 
Result<()> {
+        // (field id, message); sorted by id below for a deterministic error.
+        let mut problems: Vec<(i32, String)> = Vec::new();
+
+        // `id_to_field` is flattened, so checking each field by its own type 
keeps the
+        // blame on the offending leaf, not its container (mirrors Java's 
`lazyIdToField`).
+        for field in self.id_to_field.values() {
+            let min_version = leaf_min_format_version(&field.field_type);
+            if format_version < min_version {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());

Review Comment:
   This is a bug, we should return error.



##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -421,6 +425,85 @@ impl Schema {
     pub fn field_id_to_fields(&self) -> &HashMap<i32, NestedFieldRef> {
         &self.id_to_field
     }
+
+    /// Minimum [`FormatVersion`] required to represent all *types* in this 
schema.
+    ///
+    /// Types only; for initial-default version floors see 
[`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {

Review Comment:
   ```suggestion
       pub fn calc_min_compatible_format(&self) -> FormatVersion {
   ```
   
   Also please change comments to clarify that this will visit whole schema to 
get this, or how about we store it a lazy field in `Schema`?



##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -421,6 +425,85 @@ impl Schema {
     pub fn field_id_to_fields(&self) -> &HashMap<i32, NestedFieldRef> {
         &self.id_to_field
     }
+
+    /// Minimum [`FormatVersion`] required to represent all *types* in this 
schema.
+    ///
+    /// Types only; for initial-default version floors see 
[`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {
+        // `id_to_field` is flattened, so the max over all fields covers 
nested ones too.
+        self.id_to_field
+            .values()
+            .map(|f| leaf_min_format_version(&f.field_type))
+            .max()
+            .unwrap_or(FormatVersion::V1)
+    }
+
+    /// Returns an error listing every field incompatible with 
`format_version`.
+    /// Mirrors Java's `Schema.checkCompatibility()`. Two checks per field:
+    ///
+    /// - **Type** — per `leaf_min_format_version`.
+    /// - **Initial default** — a non-null `initial_default` backfills 
pre-existing rows,
+    ///   so it requires [`MIN_FORMAT_VERSION_DEFAULT_VALUES`]; 
`write_default` is not
+    ///   checked, as it only affects newly written rows (read identically at 
any version).
+    pub fn check_format_compatibility(&self, format_version: FormatVersion) -> 
Result<()> {
+        // (field id, message); sorted by id below for a deterministic error.
+        let mut problems: Vec<(i32, String)> = Vec::new();
+
+        // `id_to_field` is flattened, so checking each field by its own type 
keeps the
+        // blame on the offending leaf, not its container (mirrors Java's 
`lazyIdToField`).
+        for field in self.id_to_field.values() {
+            let min_version = leaf_min_format_version(&field.field_type);
+            if format_version < min_version {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());
+                problems.push((field.id, format!(
+                    "Invalid type for {name}: {} is not supported until 
{min_version} but format version is {format_version}.",
+                    field.field_type,
+                )));
+            }
+
+            if let Some(default) = &field.initial_default
+                && format_version < MIN_FORMAT_VERSION_DEFAULT_VALUES
+            {
+                let name = self
+                    .name_by_field_id(field.id)
+                    .unwrap_or(field.name.as_str());

Review Comment:
   Ditto.



##########
crates/iceberg/src/spec/schema/mod.rs:
##########
@@ -421,6 +425,85 @@ impl Schema {
     pub fn field_id_to_fields(&self) -> &HashMap<i32, NestedFieldRef> {
         &self.id_to_field
     }
+
+    /// Minimum [`FormatVersion`] required to represent all *types* in this 
schema.
+    ///
+    /// Types only; for initial-default version floors see 
[`Schema::check_format_compatibility`].
+    pub fn min_format_version(&self) -> FormatVersion {
+        // `id_to_field` is flattened, so the max over all fields covers 
nested ones too.
+        self.id_to_field
+            .values()
+            .map(|f| leaf_min_format_version(&f.field_type))
+            .max()
+            .unwrap_or(FormatVersion::V1)
+    }
+
+    /// Returns an error listing every field incompatible with 
`format_version`.
+    /// Mirrors Java's `Schema.checkCompatibility()`. Two checks per field:
+    ///
+    /// - **Type** — per `leaf_min_format_version`.

Review Comment:
   ```suggestion
       /// - **Type** — Minimum format version required to support that type, 
without taking nested filed types into account.
   ```
   
   The `leaf_min_format_version` is implementation detail, and we should not 
show it in comments.



##########
crates/iceberg/src/spec/schema/prune_columns.rs:
##########


Review Comment:
   I'm not talking about `prune_columns` only, I mean all other affects parts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to