szehon-ho opened a new pull request, #56712:
URL: https://github.com/apache/spark/pull/56712

   ### What changes were proposed in this pull request?
   
   Add support for auto-filling generated column values and enforcing generated 
column constraints during V2 table writes (INSERT).
   
   When a catalog declares `SUPPORT_GENERATED_COLUMN_ON_WRITE`:
   - Missing generated columns are auto-filled using the generation expression 
via a `Project` node added during analysis
   - User-provided generated column values are validated against the generation 
expression via `CheckInvariant(EqualNullSafe(col, genExpr))`
   - MERGE, UPDATE, and streaming writes with generated columns are blocked 
until support is implemented
   
   **Key changes:**
   - `TableCatalogCapability`: New `SUPPORT_GENERATED_COLUMN_ON_WRITE` 
capability
   - `CatalogV2Util`: Encode generation expression in V2-to-StructField 
metadata roundtrip
   - `TableOutputResolver`: Fill missing generated columns in by-name and 
by-position write paths (checking generation expression before null defaults)
   - `ResolveTableConstraints`: Add generated column constraints using V2 
expression with SQL parser fallback, scoped to user-provided columns only
   - `Analyzer` (`ResolveOutputRelation`): Gate on capability; strip generation 
expression metadata from auto-filled columns so constraints are scoped correctly
   - `RewriteRowLevelCommand`: Block MERGE/UPDATE with generated columns
   - `ResolveWriteToStream`: Block streaming writes with generated columns
   - `SQLConf`: New `spark.sql.generatedColumn.allowNullableIngest.enabled` 
config
   
   ### Why are the changes needed?
   
   Generated columns (defined with `GENERATED ALWAYS AS (expr)`) currently only 
support DDL — the generation expression is stored but never evaluated during 
writes. This means connectors must handle generated column values themselves 
(e.g., Delta Lake does this in its own execution layer).
   
   This PR adds Spark-native support so that any V2 catalog can opt in to 
having Spark auto-compute generated column values during INSERT. This brings 
generated columns closer to parity with default column values, which Spark 
already handles during writes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. For catalogs that declare `SUPPORT_GENERATED_COLUMN_ON_WRITE`:
   - Users can omit generated columns from INSERT statements and the values are 
auto-computed
   - Users who explicitly provide generated column values will get a 
`CHECK_CONSTRAINT_VIOLATION` error if the value doesn't match the generation 
expression
   - MERGE, UPDATE, and streaming writes to tables with generated columns will 
fail with `UNSUPPORTED_FEATURE.TABLE_OPERATION`
   - New config `spark.sql.generatedColumn.allowNullableIngest.enabled` 
(default `true`) controls whether nullable non-generated columns can be omitted 
when writing to tables with generated columns
   
   ### How was this patch tested?
   
   New `GeneratedColumnWriteSuite` with 30 tests covering:
   - Auto-fill by name and by position
   - Constraint validation (matching/non-matching/NULL values)
   - Multiple generated columns, multi-column expressions, complex expressions 
(CAST)
   - Type coercion, column reordering, case-insensitive matching
   - INSERT OVERWRITE, INSERT SELECT, CTAS, DataFrame API
   - Generated column in middle of schema, non-trailing by position (error)
   - Generated partition columns
   - Config for nullable ingest
   - Combined with table CHECK constraints
   - Plan inspection (CheckInvariant present for user-provided, absent for 
auto-filled)
   - Streaming write blocking
   - Mix of auto-filled and user-provided generated columns
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to