Re: [PR] [SPARK-57412][SQL] Apply default collation to generated column expression references [spark]

via GitHub Fri, 12 Jun 2026 16:40:46 -0700


szehon-ho commented on code in PR #56465:
URL: https://github.com/apache/spark/pull/56465#discussion_r3406785362



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
    * collated types.
    */
   private def transformExpression: PartialFunction[Expression, String => 
Expression] = {
-    case columnDef: ColumnDefinition if 
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
-      collation => columnDef.copy(dataType =
-        replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+    case columnDef: ColumnDefinition
+        if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+          generationExpressionNeedsCollation(columnDef) =>
+      collation =>
+        // Re-point any default string/char/varchar typed references inside 
the generation
+        // expression to the default collation, matching the collation applied 
to the columns
+        // they reference. Other nodes (e.g. literals and casts) are handled 
by the dedicated
+        // cases below as part of the bottom-up traversal.
+        val newGenExpr = columnDef.generationExpression.map { genExpr =>
+          val newChild = genExpr.child.transform {
+            case a: AttributeReference if 
hasDefaultStringCharOrVarcharType(a.dataType) =>
+              a.copy(dataType =
+                replaceDefaultStringCharAndVarcharTypes(a.dataType, 
collation))(
+                a.exprId, a.qualifier)

Review Comment:
   Done, switched to `a.withDataType(...)`.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -371,6 +388,17 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
         subquery.withNewPlan(newPlan)
   }
 
+  /**
+   * Returns true if the column's generation expression references a default 
string/char/varchar
+   * type, so the column is transformed even when its own type is not a 
default string type
+   * (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
+   */
+  private def generationExpressionNeedsCollation(columnDef: ColumnDefinition): 
Boolean = {
+    columnDef.generationExpression.exists { genExpr =>
+      genExpr.child.exists(e => hasDefaultStringCharOrVarcharType(e.dataType))

Review Comment:
   Good catch, fixed. The guard now matches only `AttributeReference`, so it 
never calls `dataType` on unresolved nodes:
   
   ```scala
   genExpr.child.exists {
     case a: AttributeReference => hasDefaultStringCharOrVarcharType(a.dataType)
     case _ => false
   }
   ```
   
   Also added a `CollationSuite` test that creates a generated column with a 
typo'd reference under `DEFAULT COLLATION` and asserts it reports 
`UNRESOLVED_COLUMN.WITH_SUGGESTION` (not `INTERNAL_ERROR`).



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
    * collated types.
    */
   private def transformExpression: PartialFunction[Expression, String => 
Expression] = {
-    case columnDef: ColumnDefinition if 
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
-      collation => columnDef.copy(dataType =
-        replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+    case columnDef: ColumnDefinition
+        if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+          generationExpressionNeedsCollation(columnDef) =>
+      collation =>
+        // Re-point any default string/char/varchar typed references inside 
the generation
+        // expression to the default collation, matching the collation applied 
to the columns
+        // they reference. Other nodes (e.g. literals and casts) are handled 
by the dedicated
+        // cases below as part of the bottom-up traversal.
+        val newGenExpr = columnDef.generationExpression.map { genExpr =>

Review Comment:
   I think this is benign and best left as a follow-up. A couple of reasons:
   
   - **The SQL standard allows collations in CHECK constraints.** ISO/IEC 
9075-2 (clause 11.9) only requires a check `<search condition>` to be 
*retrospectively deterministic* (depend solely on the row's data, no 
non-deterministic functions/subqueries). A collated comparison like `c = 'A'` 
under `UTF8_LCASE` qualifies. So, unlike generated columns, there is no 
"non-UTF8-binary collation not allowed" rule to enforce here — which means 
there is no validation that the stale `UTF8_BINARY` typing could wrongly pass. 
The generated-column restriction is a stored-value concern that does not apply 
to constraints.
   - **Enforcement re-resolves from `predicateSql`** against the live 
(collated) schema at write time, so the runtime comparison uses the correct 
collation regardless of the persisted typed tree. The only artifact is a 
cosmetically stale type in the serialized predicate, with no behavioral effect.
   
   One nuance for completeness: collation comparisons are deterministic for a 
fixed collation definition; an ICU collation-version upgrade could in principle 
flip a comparison's result, but that is a general property of collated columns 
everywhere, not specific to constraints, and not introduced by this PR.
   
   Happy to file a follow-up to re-point constraint references for consistency 
(ideally via a shared helper), but it isn't needed for correctness here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57412][SQL] Apply default collation to generated column expression references [spark]

Reply via email to