[I] native_datafusion (Spark 3.x): shim's ParquetSchemaConvert translation produces an extra SparkException cause-chain layer [datafusion-comet]

via GitHub Fri, 15 May 2026 19:17:02 -0700


andygrove opened a new issue, #4354:
URL: https://github.com/apache/datafusion-comet/issues/4354


   ## Description
   
   On Spark 3.x, Comet's native-error → JVM-exception shim
   
(`spark/src/main/spark-3.{4,5}/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala`)
   translates a native `ParquetSchemaConvert` error into a `SparkException` 
whose
   cause is `SchemaColumnConvertNotSupportedException`:
   
   ```scala
   val cause = new SchemaColumnConvertNotSupportedException(column, 
physicalType, logicalType)
   QueryExecutionErrors.unsupportedSchemaColumnConvertError(filePath, column, 
logicalType,
     physicalType, cause)
   // returns: new SparkException(errorClass = "_LEGACY_ERROR_TEMP_2063", ..., 
cause = e)
   ```
   
   Spark 3.x's executor / task error handling then re-wraps this 
`SparkException`
   once more on the way back to the driver, producing a two-level chain:
   
   ```
   SparkException (driver-side wrapping)
     cause -> SparkException (shim-generated, errorClass 
"_LEGACY_ERROR_TEMP_2063")
       cause -> SchemaColumnConvertNotSupportedException
   ```
   
   Spark's own vectorized reader produces a one-level chain because
   `ParquetVectorUpdaterFactory.getUpdater` throws
   `SchemaColumnConvertNotSupportedException` directly; the file-scan code 
catches
   it once and wraps in a `SparkException`. Spark 4.0+ also produces a one-level
   chain for Comet because the 4.x shim's `parquetColumnDataTypeMismatchError` 
path
   appears not to be re-wrapped by the executor.
   
   ## Why it matters
   
   Spark's own `SPARK-34212 Parquet should read decimals correctly` (and similar
   tests) assert the cause directly:
   
   ```scala
   val e = intercept[SparkException] { readParquet(schema, path).collect() 
}.getCause
   assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])
   ```
   
   On Comet 3.x, `e.getCause` is the inner `SparkException`, not the
   `SchemaColumnConvertNotSupportedException`, so the assertion fails. Tests 
that
   walk the cause chain (e.g. our regression test in `ParquetReadSuite`) pass.
   
   ## Affected tests (currently kept ignored)
   
   - `dev/diffs/3.4.3.diff` — `SPARK-34212 Parquet should read decimals 
correctly`
     (`ParquetQuerySuite`).
   - `dev/diffs/3.5.8.diff` — same.
   
   These would be unignored in `4.0.2.diff` / `4.1.1.diff` (where the chain is
   one-level and the schema-adapter rejection from #4351 is in place).
   
   ## Suggested fix
   
   Change the 3.x shim to throw `SchemaColumnConvertNotSupportedException`
   directly rather than wrapping it in `unsupportedSchemaColumnConvertError`'s
   `SparkException`. Spark's task error handling will wrap it once on the way
   back to the driver, producing the same one-level chain Spark's own vectorized
   reader produces. The error message format (`Parquet column cannot be 
converted
   in file …`) needs to be preserved since some Spark SQL tests assert on it.
   
   ## Related
   
   - #4351 — the schema-adapter rejection that surfaces this chain.
   - #3720 — parent umbrella.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] native_datafusion (Spark 3.x): shim's ParquetSchemaConvert translation produces an extra SparkException cause-chain layer [datafusion-comet]

Reply via email to