gengliangwang commented on code in PR #56619:
URL: https://github.com/apache/spark/pull/56619#discussion_r3454782837
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##########
@@ -445,6 +448,11 @@ private[spark] object SchemaUtils {
fieldsByName.foreach { case (normalizedName, field) =>
otherFieldsByName.get(normalizedName) match {
case Some(otherField) =>
+ if (checkFieldIds) {
+ for (id <- field.id; otherId <- otherField.id if id !=
otherId) {
+ errors += s"${colPath.fullyQuoted} field ID has changed from
$id to $otherId"
Review Comment:
This reports the parent struct's path, not the field whose ID changed. The
check runs one level above the recursion that appends `field.name`, so unlike
the sibling errors (`formatField` for removed/added, and the type/nullability
messages emitted inside the recursion) it omits the field name: a top-level
column change becomes `" field ID has changed from 1 to 99"` (empty path), and
a nested change names the container — two changed sibling fields then produce
indistinguishable messages.
```suggestion
errors += s"${(colPath :+ field.name).fullyQuoted} field
ID has changed from $id to $otherId"
```
The six `V2TableUtilSuite` expectations (lines 569/587/602/621-622/639/656)
assert the current output and will need updating; the explanatory comment at
line 586 also carries a non-ASCII em-dash — switch it to ASCII while you're
there.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala:
##########
@@ -653,6 +672,9 @@ private[sql] object CatalogV2Util {
Option(col.defaultValue()).foreach { default =>
f = encodeDefaultValue(default, f)
}
+ Option(col.id()).foreach { id =>
+ f = f.withId(id)
Review Comment:
Encoding the top-level ID here also rides into the user-facing schema:
`DataSourceV2Relation.create` builds its output from `table.columns.asSchema`
(which calls this) via `toAttributes`, and `FIELD_ID` isn't in
`INTERNAL_METADATA_KEYS`, so it looks like it ends up in `df.schema` field
metadata for every column / nested field of an ID-tracking connector. Is that
intended? If not, adding `FIELD_ID` to `INTERNAL_METADATA_KEYS` (or stripping
it at the relation boundary) would keep it internal. I traced the encode path
but didn't confirm end-user visibility — could you check?
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SchemaUtils.scala:
##########
@@ -522,6 +534,39 @@ private[spark] object SchemaUtils {
fields.map(field => field.name.toLowerCase(Locale.ROOT) -> field).toMap
}
}
+
+ /**
+ * Recursively removes field IDs from a data type. Only allocates new
objects when a field ID
+ * is actually present, returning the original instance unchanged when there
is nothing to strip.
+ */
+ def removeFieldIds(dataType: DataType): DataType = dataType match {
+ case s: StructType =>
+ StructType(s.fields.map { field =>
Review Comment:
This branch always rebuilds via `StructType(s.fields.map(...))`, so it never
returns the original `s` — but the Scaladoc promises "the original instance
unchanged when there is nothing to strip," which only the `ArrayType`/`MapType`
branches honor. `InMemoryBaseTable.assignFieldIds` (line 945) shows the struct
short-circuit: track whether any field changed and return `s` if none did.
Either apply that here or relax the comment.
##########
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala:
##########
@@ -916,32 +907,60 @@ abstract class InMemoryBaseTable(
object InMemoryBaseTable {
private val columnIdGlobalCounter = new AtomicLong(0)
def nextColumnId(): Long = columnIdGlobalCounter.incrementAndGet()
-
- private def normalize(name: String): String = name.toLowerCase(Locale.ROOT)
+ def nextColumnIdString(): String = nextColumnId().toString
/**
- * Preserves column IDs from `oldColumns` when the column name matches,
- * and assigns new IDs to columns that do not already have one.
+ * Assigns fresh IDs to any top-level column or nested struct field that
does not already
+ * have one. Recurses into struct fields within ArrayType and MapType so
that every field
+ * at every depth gets an ID.
*
- * IDs are preserved across type changes, keeping the same column ID through
type
- * widening and nested field additions.
[[TypeChangeResetsColIdTableCatalog]] overrides
- * this behavior for testing scenarios where type changes should produce a
new ID.
+ * Existing IDs are preserved: Column → StructType → Column round-trip
encodes them in
Review Comment:
Non-ASCII arrows (U+2192); use ASCII (CLAUDE.md / scalastyle):
```suggestion
* Existing IDs are preserved: Column -> StructType -> Column round-trip
encodes them in
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]