Kent Yao created SPARK-55716:
--------------------------------

             Summary: V1 file-based DataSource writes silently accept null 
values into NOT NULL columns
                 Key: SPARK-55716
                 URL: https://issues.apache.org/jira/browse/SPARK-55716
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Kent Yao


V1 file-based DataSource writes (parquet/orc/json) silently accept null values 
into NOT NULL columns. The root cause has two parts:

1. `DataSource.resolveRelation()` calls `dataSchema.asNullable` at line 439, 
which strips NOT NULL constraints recursively. This was added in SPARK-13738 
(2016) for read safety — files may contain nulls regardless of schema. However, 
this also affects the write path.

2. `CreateDataSourceTableCommand` stores `dataSource.schema` (post-asNullable) 
in the catalog at line 111, permanently losing NOT NULL information.

As a result, `PreprocessTableInsertion` never injects `AssertNotNull` for V1 
file source tables because the schema it sees is all-nullable.

Note that `InsertableRelation` (e.g., `SimpleInsertSource`) does NOT have this 
problem because it preserves the original schema (SPARK-24583).

**Fix:**
- Fix `CreateDataSourceTableCommand` to preserve user-specified nullability 
using recursive nullability merging (the resolved `dataSource.schema` may have 
CharVarchar normalization and metadata that must be kept).
- Fix `PreprocessTableInsertion` to restore nullability flags from the catalog 
schema before null checks.
- Add a legacy config `spark.sql.legacy.allowNullInsertForFileSourceTables` 
(default false) to gate the write-side enforcement for backward compatibility.

**Scope:**
- This fix covers catalog-based table writes (INSERT INTO, INSERT OVERWRITE).
- DataFrame `df.write.format().save()` without a catalog table is NOT affected 
(no catalog schema to reference).
- Both top-level and nested type nullability (array elements, struct fields, 
map values) are enforced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to