schenksj opened a new pull request, #4536:
URL: https://github.com/apache/datafusion-comet/pull/4536
## Which issue does this PR close?
Closes #4529.
(Supersedes #4534, which was closed; this PR carries the
typed-classification design described below.)
## Rationale for this change
When Comet's native DataFusion scan hits a corrupt footer, corrupt
page/column data, a truncated/empty file, or a deleted file, it rethrew the raw
native message instead of Spark's `FAILED_READ_FILE`, and the offending path
was usually missing.
This is a standalone error-compatibility improvement for all native Parquet
scans. It was surfaced while working on the Delta Lake contrib integration
(Delta's corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE`
message and path).
## What changes are included in this PR?
File-read failures are classified by **typed `DataFusionError` variant** in
the native error path — not by matching error-message prose. The messages come
from three upstream crates (DataFusion, arrow-rs, object_store) and drift
across version bumps with no compile-time signal; the typed match arms are
checked by the compiler.
- **native**: new `SparkError::CannotReadFile { file_path, message }`;
`try_classify_file_read_error` in the JNI bridge maps a file-read
`DataFusionError` (`ParquetError`, `ObjectStore`, `ArrowError` wrapping a
`ParquetError`, `IoError`; unwrapping `Context`/`Shared`) into it. `file_path`
is taken from `object_store::Error::NotFound` when available. It deliberately
does **not** match `object_store::Error::Generic` /
`DataFusionError::Execution`, which also carry non-file config errors (e.g.
"Hdfs support is not enabled in this build") that must surface as-is.
- **JVM**: the structured error crosses JNI as the existing
`CometQueryExecutionException` JSON payload; `SparkErrorConverter` decodes
`"CannotReadFile"` and, when the native error carried no path, fills it from
the per-task file list threaded from `CometNativeScanExec` via `CometExecRDD`.
The spark-3.5 and spark-4.x shims wrap it via
`QueryExecutionErrors.cannotReadFilesError`. No JVM-side message matching.
## How are these changes tested?
Coverage across all three layers of the new design:
- **native classifier** (`errors.rs` unit tests, 6): each `DataFusionError`
variant — `ParquetError`, `ArrowError(ParquetError)`, `ObjectStore::NotFound`
(asserts the path is carried), `IoError`, `Context`/`Shared` unwrapping —
classifies as a file-read error; `Execution`/`Internal` (non-file) do **not**.
- **JVM decode** (`SparkErrorConverterSuite`, 3): `CannotReadFile` →
`FAILED_READ_FILE` naming the file; empty native path falls back to the
per-task file list; a native path is preferred over the fallback.
- **end-to-end** (`CometExecSuite`, 1): writes a valid parquet file,
corrupts column/page data while leaving the footer intact (so Spark's JVM
footer pre-check passes and the failure surfaces from the native reader during
execution), reads it through Comet, and asserts the cause chain contains a
`FAILED_READ_FILE` exception naming the file. Fails on `main` (raw
`CometNativeException`), passes here.
`CometExecSuite` (127/0) and `SparkErrorConverterSuite` (11/11) pass.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]