szehon-ho commented on code in PR #56465:
URL: https://github.com/apache/spark/pull/56465#discussion_r3406785362
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
* collated types.
*/
private def transformExpression: PartialFunction[Expression, String =>
Expression] = {
- case columnDef: ColumnDefinition if
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
- collation => columnDef.copy(dataType =
- replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+ case columnDef: ColumnDefinition
+ if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+ generationExpressionNeedsCollation(columnDef) =>
+ collation =>
+ // Re-point any default string/char/varchar typed references inside
the generation
+ // expression to the default collation, matching the collation applied
to the columns
+ // they reference. Other nodes (e.g. literals and casts) are handled
by the dedicated
+ // cases below as part of the bottom-up traversal.
+ val newGenExpr = columnDef.generationExpression.map { genExpr =>
+ val newChild = genExpr.child.transform {
+ case a: AttributeReference if
hasDefaultStringCharOrVarcharType(a.dataType) =>
+ a.copy(dataType =
+ replaceDefaultStringCharAndVarcharTypes(a.dataType,
collation))(
+ a.exprId, a.qualifier)
Review Comment:
Done, switched to `a.withDataType(...)`.
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -371,6 +388,17 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
subquery.withNewPlan(newPlan)
}
+ /**
+ * Returns true if the column's generation expression references a default
string/char/varchar
+ * type, so the column is transformed even when its own type is not a
default string type
+ * (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
+ */
+ private def generationExpressionNeedsCollation(columnDef: ColumnDefinition):
Boolean = {
+ columnDef.generationExpression.exists { genExpr =>
+ genExpr.child.exists(e => hasDefaultStringCharOrVarcharType(e.dataType))
Review Comment:
Good catch, fixed. The guard now matches only `AttributeReference`, so it
never calls `dataType` on unresolved nodes:
```scala
genExpr.child.exists {
case a: AttributeReference => hasDefaultStringCharOrVarcharType(a.dataType)
case _ => false
}
```
Also added a `CollationSuite` test that creates a generated column with a
typo'd reference under `DEFAULT COLLATION` and asserts it reports
`UNRESOLVED_COLUMN.WITH_SUGGESTION` (not `INTERNAL_ERROR`).
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
* collated types.
*/
private def transformExpression: PartialFunction[Expression, String =>
Expression] = {
- case columnDef: ColumnDefinition if
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
- collation => columnDef.copy(dataType =
- replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+ case columnDef: ColumnDefinition
+ if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+ generationExpressionNeedsCollation(columnDef) =>
+ collation =>
+ // Re-point any default string/char/varchar typed references inside
the generation
+ // expression to the default collation, matching the collation applied
to the columns
+ // they reference. Other nodes (e.g. literals and casts) are handled
by the dedicated
+ // cases below as part of the bottom-up traversal.
+ val newGenExpr = columnDef.generationExpression.map { genExpr =>
Review Comment:
I think this is benign and best left as a follow-up. A couple of reasons:
- **The SQL standard allows collations in CHECK constraints.** ISO/IEC
9075-2 (clause 11.9) only requires a check `<search condition>` to be
*retrospectively deterministic* (depend solely on the row's data, no
non-deterministic functions/subqueries). A collated comparison like `c = 'A'`
under `UTF8_LCASE` qualifies. So, unlike generated columns, there is no
"non-UTF8-binary collation not allowed" rule to enforce here — which means
there is no validation that the stale `UTF8_BINARY` typing could wrongly pass.
The generated-column restriction is a stored-value concern that does not apply
to constraints.
- **Enforcement re-resolves from `predicateSql`** against the live
(collated) schema at write time, so the runtime comparison uses the correct
collation regardless of the persisted typed tree. The only artifact is a
cosmetically stale type in the serialized predicate, with no behavioral effect.
One nuance for completeness: collation comparisons are deterministic for a
fixed collation definition; an ICU collation-version upgrade could in principle
flip a comparison's result, but that is a general property of collated columns
everywhere, not specific to constraints, and not introduced by this PR.
Happy to file a follow-up to re-point constraint references for consistency
(ideally via a shared helper), but it isn't needed for correctness here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]