szehon-ho opened a new pull request, #56465:
URL: https://github.com/apache/spark/pull/56465

   ### What changes were proposed in this pull request?
   `ApplyDefaultCollation` now re-points default string/char/varchar 
`AttributeReference`s inside a column's generation expression to the table's 
default collation, in the same place it assigns the default collation to the 
`ColumnDefinition` itself. The `ColumnDefinition` case guard is broadened so 
this also runs when the generated column's own output type is not a default 
string type (e.g. `c INT GENERATED ALWAYS AS (LENGTH(s))`).
   
   ### Why are the changes needed?
   This is a regression from SPARK-55347, which moved generated-column 
validation from the standalone `GeneratedColumnAnalyzer` (previously run at 
planning time against the final, collation-applied schema) to inline analysis 
plus `CheckAnalysis` / `GeneratedColumnExpression.validate`.
   
   Generated columns are not allowed to involve non-UTF8-binary collated string 
types.
   
   **Before SPARK-55347 (worked):**
   1. The generation expression was kept as a SQL string, not resolved during 
analysis.
   2. `ApplyDefaultCollation` applied the table's `DEFAULT COLLATION` to the 
`ColumnDefinition`s.
   3. Later, at planning time, the expression was parsed and resolved against 
the **final** (collated) columns via `GeneratedColumnAnalyzer`.
   4. The resolved `AttributeReference`s carried the real collation, so the 
non-UTF8 collation check correctly rejected the table.
   
   **After SPARK-55347 (bug):**
   1. The generation expression is resolved early during analysis, so its 
`AttributeReference`s capture the column types as `UTF8_BINARY`.
   2. `ApplyDefaultCollation` applies the `DEFAULT COLLATION` to the 
`ColumnDefinition`s, but not to those already-resolved `AttributeReference`s 
inside the `GeneratedColumnExpression`.
   3. `CheckAnalysis` runs `GeneratedColumnExpression.validate`, which walks 
the expression tree; the `AttributeReference`s still show `UTF8_BINARY`.
   4. The check passes and the table is wrongly created.
   
   In short, the ordering flipped: before, columns were collated *then* the 
expression was resolved; after, the expression is resolved *then* the columns 
are collated, leaving the references stale.
   
   Example that wrongly succeeded:
   ```sql
   CREATE TABLE t (
     c1 STRING,
     c2 STRING GENERATED ALWAYS AS (SUBSTRING(c1, 0, 1))
   ) USING <v2source> DEFAULT COLLATION UTF8_LCASE;
   ```
   The identical schema with `c1 STRING COLLATE UTF8_LCASE` is correctly 
rejected.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. `CREATE`/`REPLACE TABLE` with a generated column whose expression 
involves a non-UTF8-binary collated column (via `DEFAULT COLLATION`) now 
correctly fails with `UNSUPPORTED_EXPRESSION_GENERATED_COLUMN`, consistent with 
the behavior for an explicit per-column `COLLATE`. This is a fix within the 
unreleased master/branch that carries SPARK-55347.
   
   ### How was this patch tested?
   New tests in `CollationSuite` covering:
   - a generated column rejected via table-level `DEFAULT COLLATION`,
   - a non-string-output generated column (`INT GENERATED ALWAYS AS 
(LENGTH(c1))`) rejected for both explicit and default collation,
   - a generated column allowed under `DEFAULT COLLATION UTF8_BINARY`.
   
   Existing `CollationSuite`, 
`DefaultCollation{String,CharVarchar}TestSuite{V1,V2}`, 
`DataSourceV2DataFrameSuite`, and `DataSourceV2SQLSuite` generated-column / 
generation-expression tests all pass.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Generated-by: Cursor


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to