Re: [PR] [SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2 [spark]

via GitHub Mon, 22 Jun 2026 12:04:04 -0700


gengliangwang commented on code in PR #56619:
URL: https://github.com/apache/spark/pull/56619#discussion_r3454782837



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##########
@@ -445,6 +448,11 @@ private[spark] object SchemaUtils {
         fieldsByName.foreach { case (normalizedName, field) =>
           otherFieldsByName.get(normalizedName) match {
             case Some(otherField) =>
+              if (checkFieldIds) {
+                for (id <- field.id; otherId <- otherField.id if id != 
otherId) {
+                  errors += s"${colPath.fullyQuoted} field ID has changed from 
$id to $otherId"

Review Comment:
   This reports the parent struct's path, not the field whose ID changed. The 
check runs one level above the recursion that appends `field.name`, so unlike 
the sibling errors (`formatField` for removed/added, and the type/nullability 
messages emitted inside the recursion) it omits the field name: a top-level 
column change becomes `" field ID has changed from 1 to 99"` (empty path), and 
a nested change names the container — two changed sibling fields then produce 
indistinguishable messages.
   
   ```suggestion
                     errors += s"${(colPath :+ field.name).fullyQuoted} field 
ID has changed from $id to $otherId"
   ```
   
   The six `V2TableUtilSuite` expectations (lines 569/587/602/621-622/639/656) 
assert the current output and will need updating; the explanatory comment at 
line 586 also carries a non-ASCII em-dash — switch it to ASCII while you're 
there.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##########
@@ -653,6 +672,9 @@ private[sql] object CatalogV2Util {
     Option(col.defaultValue()).foreach { default =>
       f = encodeDefaultValue(default, f)
     }
+    Option(col.id()).foreach { id =>
+      f = f.withId(id)

Review Comment:
   Encoding the top-level ID here also rides into the user-facing schema: 
`DataSourceV2Relation.create` builds its output from `table.columns.asSchema` 
(which calls this) via `toAttributes`, and `FIELD_ID` isn't in 
`INTERNAL_METADATA_KEYS`, so it looks like it ends up in `df.schema` field 
metadata for every column / nested field of an ID-tracking connector. Is that 
intended? If not, adding `FIELD_ID` to `INTERNAL_METADATA_KEYS` (or stripping 
it at the relation boundary) would keep it internal. I traced the encode path 
but didn't confirm end-user visibility — could you check?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##########
@@ -522,6 +534,39 @@ private[spark] object SchemaUtils {
       fields.map(field => field.name.toLowerCase(Locale.ROOT) -> field).toMap
     }
   }
+
+  /**
+   * Recursively removes field IDs from a data type. Only allocates new 
objects when a field ID
+   * is actually present, returning the original instance unchanged when there 
is nothing to strip.
+   */
+  def removeFieldIds(dataType: DataType): DataType = dataType match {
+    case s: StructType =>
+      StructType(s.fields.map { field =>

Review Comment:
   This branch always rebuilds via `StructType(s.fields.map(...))`, so it never 
returns the original `s` — but the Scaladoc promises "the original instance 
unchanged when there is nothing to strip," which only the `ArrayType`/`MapType` 
branches honor. `InMemoryBaseTable.assignFieldIds` (line 945) shows the struct 
short-circuit: track whether any field changed and return `s` if none did. 
Either apply that here or relax the comment.



##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -916,32 +907,60 @@ abstract class InMemoryBaseTable(
 object InMemoryBaseTable {
   private val columnIdGlobalCounter = new AtomicLong(0)
   def nextColumnId(): Long = columnIdGlobalCounter.incrementAndGet()
-
-  private def normalize(name: String): String = name.toLowerCase(Locale.ROOT)
+  def nextColumnIdString(): String = nextColumnId().toString
 
   /**
-   * Preserves column IDs from `oldColumns` when the column name matches,
-   * and assigns new IDs to columns that do not already have one.
+   * Assigns fresh IDs to any top-level column or nested struct field that 
does not already
+   * have one. Recurses into struct fields within ArrayType and MapType so 
that every field
+   * at every depth gets an ID.
    *
-   * IDs are preserved across type changes, keeping the same column ID through 
type
-   * widening and nested field additions. 
[[TypeChangeResetsColIdTableCatalog]] overrides
-   * this behavior for testing scenarios where type changes should produce a 
new ID.
+   * Existing IDs are preserved: Column → StructType → Column round-trip 
encodes them in

Review Comment:
   Non-ASCII arrows (U+2192); use ASCII (CLAUDE.md / scalastyle):
   ```suggestion
      * Existing IDs are preserved: Column -> StructType -> Column round-trip 
encodes them in
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57544][SQL] Rework column ID validation for nested fields in DSv2 [spark]

Reply via email to