[
https://issues.apache.org/jira/browse/SPARK-55716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55716:
-----------------------------------
Labels: pull-request-available (was: )
> V1 file-based DataSource writes silently accept null values into NOT NULL
> columns
> ---------------------------------------------------------------------------------
>
> Key: SPARK-55716
> URL: https://issues.apache.org/jira/browse/SPARK-55716
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Kent Yao
> Priority: Major
> Labels: pull-request-available
>
> V1 file-based DataSource writes (parquet/orc/json) silently accept null
> values into NOT NULL columns. The root cause has two parts:
> 1. `DataSource.resolveRelation()` calls `dataSchema.asNullable` at line 439,
> which strips NOT NULL constraints recursively. This was added in SPARK-13738
> (2016) for read safety — files may contain nulls regardless of schema.
> However, this also affects the write path.
> 2. `CreateDataSourceTableCommand` stores `dataSource.schema`
> (post-asNullable) in the catalog at line 111, permanently losing NOT NULL
> information.
> As a result, `PreprocessTableInsertion` never injects `AssertNotNull` for V1
> file source tables because the schema it sees is all-nullable.
> Note that `InsertableRelation` (e.g., `SimpleInsertSource`) does NOT have
> this problem because it preserves the original schema (SPARK-24583).
> **Fix:**
> - Fix `CreateDataSourceTableCommand` to preserve user-specified nullability
> using recursive nullability merging (the resolved `dataSource.schema` may
> have CharVarchar normalization and metadata that must be kept).
> - Fix `PreprocessTableInsertion` to restore nullability flags from the
> catalog schema before null checks.
> - Add a legacy config `spark.sql.legacy.allowNullInsertForFileSourceTables`
> (default false) to gate the write-side enforcement for backward compatibility.
> **Scope:**
> - This fix covers catalog-based table writes (INSERT INTO, INSERT OVERWRITE).
> - DataFrame `df.write.format().save()` without a catalog table is NOT
> affected (no catalog schema to reference).
> - Both top-level and nested type nullability (array elements, struct fields,
> map values) are enforced.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]