[PR] fix: Add schema validation for native_datafusion Parquet scan [datafusion-comet]

via GitHub Sun, 22 Mar 2026 01:29:02 -0700


vaibhawvipul opened a new pull request, #3759:
URL: https://github.com/apache/datafusion-comet/pull/3759


   When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader 
silently coerces incompatible types instead of erroring like Spark does.
   
   ## Which issue does this PR close?
   
   Closes #3720 .
   
   ## Rationale for this change
   
   DataFusion is more permissive than Spark when reading Parquet files with 
mismatched schemas. For example, reading an INT32 column as bigint, or 
TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw 
SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks 
correctness guarantees that Spark users rely on.
   
   ## What changes are included in this PR?
   
   Adds schema compatibility validation in `SparkPhysicalExprAdapterFactory` 
(`schema_adapter.rs`) that mirrors Spark's 
   
   TypeUtil.checkParquetType() rules:
   - `validate_spark_schema_compatibility()` checks each logical field against 
its physical counterpart when a file is opened
   - `is_spark_compatible_read()` defines the allowlist of valid 
Parquet-to-Spark type conversions (matching TypeUtil's logic)
   - Incompatible reads now produce errors in `"Column: [name], Expected: 
<type>, Found: <type>"` format
   - Correctly allows INT96→LTZ (DataFusion coerces INT96 to NTZ) and 
Timestamp→Int64 (nanosAsLong)
   
   ## How are these changes tested?
   
   - `parquet_int_as_long_should_fail` - SPARK-35640: INT32 read as bigint is 
rejected
   - `parquet_timestamp_ltz_as_ntz_should_fail` - SPARK-36182: TimestampLTZ 
read as TimestampNTZ is rejected
   - `parquet_roundtrip_unsigned_int` - UInt32→Int32 (existing test, still 
passes)
   - `test_is_spark_compatible_read` - unit test covering compatible cases 
(Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases 
(Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal 
precision/scale mismatches)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: Add schema validation for native_datafusion Parquet scan [datafusion-comet]

Reply via email to