vaibhawvipul opened a new pull request, #3759: URL: https://github.com/apache/datafusion-comet/pull/3759
When spark.comet.scan.impl=native_datafusion, DataFusion's Parquet reader silently coerces incompatible types instead of erroring like Spark does. ## Which issue does this PR close? Closes #3720 . ## Rationale for this change DataFusion is more permissive than Spark when reading Parquet files with mismatched schemas. For example, reading an INT32 column as bigint, or TimestampLTZ as TimestampNTZ, silently succeeds in DataFusion but should throw SchemaColumnConvertNotSupportedException per Spark's behavior. This breaks correctness guarantees that Spark users rely on. ## What changes are included in this PR? Adds schema compatibility validation in `SparkPhysicalExprAdapterFactory` (`schema_adapter.rs`) that mirrors Spark's TypeUtil.checkParquetType() rules: - `validate_spark_schema_compatibility()` checks each logical field against its physical counterpart when a file is opened - `is_spark_compatible_read()` defines the allowlist of valid Parquet-to-Spark type conversions (matching TypeUtil's logic) - Incompatible reads now produce errors in `"Column: [name], Expected: <type>, Found: <type>"` format - Correctly allows INT96→LTZ (DataFusion coerces INT96 to NTZ) and Timestamp→Int64 (nanosAsLong) ## How are these changes tested? - `parquet_int_as_long_should_fail` - SPARK-35640: INT32 read as bigint is rejected - `parquet_timestamp_ltz_as_ntz_should_fail` - SPARK-36182: TimestampLTZ read as TimestampNTZ is rejected - `parquet_roundtrip_unsigned_int` - UInt32→Int32 (existing test, still passes) - `test_is_spark_compatible_read` - unit test covering compatible cases (Binary→Utf8, UInt32→Int64, NTZ→LTZ, Timestamp→Int64) and incompatible cases (Utf8→Timestamp, Int32→Int64, LTZ→NTZ, Utf8→Int32, Float→Double, Decimal precision/scale mismatches) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
