cloud-fan commented on code in PR #56465:
URL: https://github.com/apache/spark/pull/56465#discussion_r3406664556
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -371,6 +388,17 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
subquery.withNewPlan(newPlan)
}
+ /**
+ * Returns true if the column's generation expression references a default
string/char/varchar
+ * type, so the column is transformed even when its own type is not a
default string type
+ * (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
+ */
+ private def generationExpressionNeedsCollation(columnDef: ColumnDefinition):
Boolean = {
+ columnDef.generationExpression.exists { genExpr =>
+ genExpr.child.exists(e => hasDefaultStringCharOrVarcharType(e.dataType))
Review Comment:
This can throw `UnresolvedException` (INTERNAL_ERROR) before `CheckAnalysis`
gets to report the real error: `resolveExpressionsUp` evaluates the
`transformExpression` guard on unresolved expressions too, and
`UnresolvedAttribute.dataType` / `UnresolvedFunction.dataType` throw. Reachable
with a typo'd column reference, e.g. `CREATE TABLE t (c1 STRING, c2 INT
GENERATED ALWAYS AS (LENGTH(c_typo))) USING ... DEFAULT COLLATION UNICODE` —
`c_typo` stays unresolved until `CheckAnalysis`, but this guard runs first and
turns the friendly unresolved-column error into an internal error. (For a
`STRING` generated column the same happens one fixed-point iteration later,
once the column's own type is collated and the first disjunct stops
short-circuiting.)
Matching only `AttributeReference` skips unresolved nodes the same way the
transform does, and also keeps the guard from staying true forever when the
only default-string node is a non-attribute (e.g. a string-producing function),
where the transform is identity anyway:
```suggestion
genExpr.child.exists {
case a: AttributeReference =>
hasDefaultStringCharOrVarcharType(a.dataType)
case _ => false
}
```
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
* collated types.
*/
private def transformExpression: PartialFunction[Expression, String =>
Expression] = {
- case columnDef: ColumnDefinition if
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
- collation => columnDef.copy(dataType =
- replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+ case columnDef: ColumnDefinition
+ if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+ generationExpressionNeedsCollation(columnDef) =>
+ collation =>
+ // Re-point any default string/char/varchar typed references inside
the generation
+ // expression to the default collation, matching the collation applied
to the columns
+ // they reference. Other nodes (e.g. literals and casts) are handled
by the dedicated
+ // cases below as part of the bottom-up traversal.
+ val newGenExpr = columnDef.generationExpression.map { genExpr =>
Review Comment:
Question, possibly for a follow-up: check constraints are resolved against
the same pre-collation `columnOutput` (the comment in
`ResolveCatalogs.resolveIdentifier` names constraints as the reason it exists),
so under a table `DEFAULT COLLATION` their `AttributeReference`s also keep
`UTF8_BINARY`, and nothing re-points them. `toV2Constraint` builds the
persisted V2 predicate from that stale-typed tree, though enforcement
re-resolves from `predicateSql` at write time. Does the persisted predicate's
typing matter for constraints, or is this benign?
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ApplyDefaultCollation.scala:
##########
@@ -348,9 +348,26 @@ object ApplyDefaultCollation extends Rule[LogicalPlan] {
* collated types.
*/
private def transformExpression: PartialFunction[Expression, String =>
Expression] = {
- case columnDef: ColumnDefinition if
hasDefaultStringCharOrVarcharType(columnDef.dataType) =>
- collation => columnDef.copy(dataType =
- replaceDefaultStringCharAndVarcharTypes(columnDef.dataType, collation))
+ case columnDef: ColumnDefinition
+ if hasDefaultStringCharOrVarcharType(columnDef.dataType) ||
+ generationExpressionNeedsCollation(columnDef) =>
+ collation =>
+ // Re-point any default string/char/varchar typed references inside
the generation
+ // expression to the default collation, matching the collation applied
to the columns
+ // they reference. Other nodes (e.g. literals and casts) are handled
by the dedicated
+ // cases below as part of the bottom-up traversal.
+ val newGenExpr = columnDef.generationExpression.map { genExpr =>
+ val newChild = genExpr.child.transform {
+ case a: AttributeReference if
hasDefaultStringCharOrVarcharType(a.dataType) =>
+ a.copy(dataType =
+ replaceDefaultStringCharAndVarcharTypes(a.dataType,
collation))(
+ a.exprId, a.qualifier)
Review Comment:
`withDataType` is the existing idiom for exactly this:
```suggestion
a.withDataType(replaceDefaultStringCharAndVarcharTypes(a.dataType, collation))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]