andygrove opened a new pull request, #3689:
URL: https://github.com/apache/datafusion-comet/pull/3689

   ## Which issue does this PR close?
   
   Closes #3311.
   
   ## Rationale for this change
   
   When `spark.comet.schemaEvolution.enabled` is set to `false` (the default), 
the `native_datafusion` scan should reject Parquet files whose physical schema 
differs from the expected logical schema (e.g., int written as long). 
Previously, `native_datafusion` silently allowed schema widening, producing 
incorrect results or confusing errors instead of the expected 
`SchemaColumnConvertNotSupportedException`-style error that Spark produces.
   
   ## What changes are included in this PR?
   
   **Runtime schema mismatch detection in native code:**
   - Added `detect_schema_mismatch()` function in `schema_adapter.rs` that 
compares logical and physical schemas per-file at runtime
   - Added `is_type_promotion()` recursive function to distinguish real type 
promotions (Int32→Int64) from adapter-handled differences (timestamp tz/unit, 
list/map/struct metadata, unsigned ints, FixedSizeBinary)
   - The `schema_evolution_enabled` config flows from JVM through protobuf to 
`SparkParquetOptions`
   
   **Spark-compatible error conversion:**
   - Added `SchemaColumnConvertNotSupported` variant to `SparkError` enum
   - Errors are emitted as `DataFusionError::External(SparkError)` so they flow 
through the JSON error path
   - Added `SchemaColumnConvertNotSupported` handler in 
`ShimSparkErrorConverter` (all 3 Spark versions) that calls 
`QueryExecutionErrors.unsupportedSchemaColumnConvertError()`, producing the 
same `SparkException` with error class `_LEGACY_ERROR_TEMP_2063` that Spark 
natively produces
   
   **Spark SQL test updates:**
   - Unignored SPARK-35640 tests (`read binary as timestamp should throw schema 
incompatible error`, `int as long should throw schema incompatible error`) for 
`native_datafusion` scan since the enforcement now produces matching Spark 
errors
   
   ## How are these changes tested?
   
   - Rust unit test `parquet_schema_mismatch_rejected_when_evolution_disabled` 
validates that type mismatches are rejected when schema evolution is disabled 
and allowed when enabled
   - Existing `ParquetReadSuite` schema evolution tests validate end-to-end 
behavior
   - Spark SQL tests (SPARK-35640) run in CI with Comet enabled to verify error 
compatibility


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to