szehon-ho opened a new pull request, #56465:
URL: https://github.com/apache/spark/pull/56465
### What changes were proposed in this pull request?
`ApplyDefaultCollation` now re-points default string/char/varchar
`AttributeReference`s inside a column's generation expression to the table's
default collation, in the same place it assigns the default collation to the
`ColumnDefinition` itself. The `ColumnDefinition` case guard is broadened so
this also runs when the generated column's own output type is not a default
string type (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
### Why are the changes needed?
This is a regression from SPARK-55347, which moved generated-column
validation from the standalone `GeneratedColumnAnalyzer` (previously run at
planning time against the final, collation-applied schema) to inline analysis
plus `CheckAnalysis` / `GeneratedColumnExpression.validate`.
Generated columns are not allowed to involve non-UTF8-binary collated string
types.
**Before SPARK-55347 (worked):**
1. The generation expression was kept as a SQL string, not resolved during
analysis.
2. `ApplyDefaultCollation` applied the table's `DEFAULT COLLATION` to the
`ColumnDefinition`s.
3. Later, at planning time, the expression was parsed and resolved against
the **final** (collated) columns via `GeneratedColumnAnalyzer`.
4. The resolved `AttributeReference`s carried the real collation, so the
non-UTF8 collation check correctly rejected the table.
**After SPARK-55347 (bug):**
1. The generation expression is resolved early during analysis, so its
`AttributeReference`s capture the column types as `UTF8_BINARY`.
2. `ApplyDefaultCollation` applies the `DEFAULT COLLATION` to the
`ColumnDefinition`s, but not to those already-resolved `AttributeReference`s
inside the `GeneratedColumnExpression`.
3. `CheckAnalysis` runs `GeneratedColumnExpression.validate`, which walks
the expression tree; the `AttributeReference`s still show `UTF8_BINARY`.
4. The check passes and the table is wrongly created.
In short, the ordering flipped: before, columns were collated *then* the
expression was resolved; after, the expression is resolved *then* the columns
are collated, leaving the references stale.
Example that wrongly succeeded:
```sql
CREATE TABLE t (
c1 STRING,
c2 STRING GENERATED ALWAYS AS (SUBSTRING(c1, 0, 1))
) USING <v2source> DEFAULT COLLATION UTF8_LCASE;
```
The identical schema with `c1 STRING COLLATE UTF8_LCASE` is correctly
rejected.
### Does this PR introduce _any_ user-facing change?
Yes. `CREATE`/`REPLACE TABLE` with a generated column whose expression
involves a non-UTF8-binary collated column (via `DEFAULT COLLATION`) now
correctly fails with `UNSUPPORTED_EXPRESSION_GENERATED_COLUMN`, consistent with
the behavior for an explicit per-column `COLLATE`. This is a fix within the
unreleased master/branch that carries SPARK-55347.
### How was this patch tested?
New tests in `CollationSuite` covering:
- a generated column rejected via table-level `DEFAULT COLLATION`,
- a non-string-output generated column (`INT GENERATED ALWAYS AS
(LENGTH(c1))`) rejected for both explicit and default collation,
- a generated column allowed under `DEFAULT COLLATION UTF8_BINARY`.
Existing `CollationSuite`,
`DefaultCollation{String,CharVarchar}TestSuite{V1,V2}`,
`DataSourceV2DataFrameSuite`, and `DataSourceV2SQLSuite` generated-column /
generation-expression tests all pass.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]