Re: [PR] [SPARK-57412][SQL] Apply default collation to generated column expression references [spark]

via GitHub Fri, 12 Jun 2026 16:02:15 -0700


cloud-fan commented on code in PR #56465:
URL: https://github.com/apache/spark/pull/56465#discussion_r3406664556



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -371,6 +388,17 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
         subquery.withNewPlan(newPlan)
   }
 
+  /**
+   * Returns true if the column's generation expression references a default 
string/char/varchar
+   * type, so the column is transformed even when its own type is not a 
default string type
+   * (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
+   */
+  private def generationExpressionNeedsCollation(columnDef: ColumnDefinition): 
Boolean = {
+    columnDef.generationExpression.exists { genExpr =>
+      genExpr.child.exists(e => hasDefaultStringCharOrVarcharType(e.dataType))

Review Comment:
   This can throw `UnresolvedException` (INTERNAL_ERROR) before `CheckAnalysis` 
gets to report the real error: `resolveExpressionsUp` evaluates the 
`transformExpression` guard on unresolved expressions too, and 
`UnresolvedAttribute.dataType` / `UnresolvedFunction.dataType` throw. Reachable 
with a typo'd column reference, e.g. `CREATE TABLE t (c1 STRING, c2 INT 
GENERATED ALWAYS AS (LENGTH(c_typo))) USING ... DEFAULT COLLATION UNICODE` — 
`c_typo` stays unresolved until `CheckAnalysis`, but this guard runs first and 
turns the friendly unresolved-column error into an internal error. (For a 
`STRING` generated column the same happens one fixed-point iteration later, 
once the column's own type is collated and the first disjunct stops 
short-circuiting.)
   
   Matching only `AttributeReference` skips unresolved nodes the same way the 
transform does, and also keeps the guard from staying true forever when the 
only default-string node is a non-attribute (e.g. a string-producing function), 
where the transform is identity anyway:
   
   ```suggestion
         genExpr.child.exists {
           case a: AttributeReference => 
hasDefaultStringCharOrVarcharType(a.dataType)
           case _ => false
         }
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
    * collated types.
    */
   private def transformExpression: PartialFunction[Expression, String => 
Expression] = {
-    case columnDef: ColumnDefinition if 
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
-      collation => columnDef.copy(dataType =
-        replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+    case columnDef: ColumnDefinition
+        if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+          generationExpressionNeedsCollation(columnDef) =>
+      collation =>
+        // Re-point any default string/char/varchar typed references inside 
the generation
+        // expression to the default collation, matching the collation applied 
to the columns
+        // they reference. Other nodes (e.g. literals and casts) are handled 
by the dedicated
+        // cases below as part of the bottom-up traversal.
+        val newGenExpr = columnDef.generationExpression.map { genExpr =>

Review Comment:
   Question, possibly for a follow-up: check constraints are resolved against 
the same pre-collation `columnOutput` (the comment in 
`ResolveCatalogs.resolveIdentifier` names constraints as the reason it exists), 
so under a table `DEFAULT COLLATION` their `AttributeReference`s also keep 
`UTF8_BINARY`, and nothing re-points them. `toV2Constraint` builds the 
persisted V2 predicate from that stale-typed tree, though enforcement 
re-resolves from `predicateSql` at write time. Does the persisted predicate's 
typing matter for constraints, or is this benign?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
    * collated types.
    */
   private def transformExpression: PartialFunction[Expression, String => 
Expression] = {
-    case columnDef: ColumnDefinition if 
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
-      collation => columnDef.copy(dataType =
-        replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+    case columnDef: ColumnDefinition
+        if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+          generationExpressionNeedsCollation(columnDef) =>
+      collation =>
+        // Re-point any default string/char/varchar typed references inside 
the generation
+        // expression to the default collation, matching the collation applied 
to the columns
+        // they reference. Other nodes (e.g. literals and casts) are handled 
by the dedicated
+        // cases below as part of the bottom-up traversal.
+        val newGenExpr = columnDef.generationExpression.map { genExpr =>
+          val newChild = genExpr.child.transform {
+            case a: AttributeReference if 
hasDefaultStringCharOrVarcharType(a.dataType) =>
+              a.copy(dataType =
+                replaceDefaultStringCharAndVarcharTypes(a.dataType, 
collation))(
+                a.exprId, a.qualifier)

Review Comment:
   `withDataType` is the existing idiom for exactly this:
   ```suggestion
                 
a.withDataType(replaceDefaultStringCharAndVarcharTypes(a.dataType, collation))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57412][SQL] Apply default collation to generated column expression references [spark]

Reply via email to