andygrove opened a new issue, #4354:
URL: https://github.com/apache/datafusion-comet/issues/4354
## Description
On Spark 3.x, Comet's native-error → JVM-exception shim
(`spark/src/main/spark-3.{4,5}/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala`)
translates a native `ParquetSchemaConvert` error into a `SparkException`
whose
cause is `SchemaColumnConvertNotSupportedException`:
```scala
val cause = new SchemaColumnConvertNotSupportedException(column,
physicalType, logicalType)
QueryExecutionErrors.unsupportedSchemaColumnConvertError(filePath, column,
logicalType,
physicalType, cause)
// returns: new SparkException(errorClass = "_LEGACY_ERROR_TEMP_2063", ...,
cause = e)
```
Spark 3.x's executor / task error handling then re-wraps this
`SparkException`
once more on the way back to the driver, producing a two-level chain:
```
SparkException (driver-side wrapping)
cause -> SparkException (shim-generated, errorClass
"_LEGACY_ERROR_TEMP_2063")
cause -> SchemaColumnConvertNotSupportedException
```
Spark's own vectorized reader produces a one-level chain because
`ParquetVectorUpdaterFactory.getUpdater` throws
`SchemaColumnConvertNotSupportedException` directly; the file-scan code
catches
it once and wraps in a `SparkException`. Spark 4.0+ also produces a one-level
chain for Comet because the 4.x shim's `parquetColumnDataTypeMismatchError`
path
appears not to be re-wrapped by the executor.
## Why it matters
Spark's own `SPARK-34212 Parquet should read decimals correctly` (and similar
tests) assert the cause directly:
```scala
val e = intercept[SparkException] { readParquet(schema, path).collect()
}.getCause
assert(e.isInstanceOf[SchemaColumnConvertNotSupportedException])
```
On Comet 3.x, `e.getCause` is the inner `SparkException`, not the
`SchemaColumnConvertNotSupportedException`, so the assertion fails. Tests
that
walk the cause chain (e.g. our regression test in `ParquetReadSuite`) pass.
## Affected tests (currently kept ignored)
- `dev/diffs/3.4.3.diff` — `SPARK-34212 Parquet should read decimals
correctly`
(`ParquetQuerySuite`).
- `dev/diffs/3.5.8.diff` — same.
These would be unignored in `4.0.2.diff` / `4.1.1.diff` (where the chain is
one-level and the schema-adapter rejection from #4351 is in place).
## Suggested fix
Change the 3.x shim to throw `SchemaColumnConvertNotSupportedException`
directly rather than wrapping it in `unsupportedSchemaColumnConvertError`'s
`SparkException`. Spark's task error handling will wrap it once on the way
back to the driver, producing the same one-level chain Spark's own vectorized
reader produces. The error message format (`Parquet column cannot be
converted
in file …`) needs to be preserved since some Spark SQL tests assert on it.
## Related
- #4351 — the schema-adapter rejection that surfaces this chain.
- #3720 — parent umbrella.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]