[PR] feat: surface native parquet read failures as FAILED_READ_FILE [datafusion-comet]

via GitHub Sat, 30 May 2026 06:37:22 -0700


schenksj opened a new pull request, #4536:
URL: https://github.com/apache/datafusion-comet/pull/4536


   ## Which issue does this PR close?
   
   Closes #4529.
   
   (Supersedes #4534, which was closed; this PR carries the 
typed-classification design described below.)
   
   ## Rationale for this change
   
   When Comet's native DataFusion scan hits a corrupt footer, corrupt 
page/column data, a truncated/empty file, or a deleted file, it rethrew the raw 
native message instead of Spark's `FAILED_READ_FILE`, and the offending path 
was usually missing.
   
   This is a standalone error-compatibility improvement for all native Parquet 
scans. It was surfaced while working on the Delta Lake contrib integration 
(Delta's corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE` 
message and path).
   
   ## What changes are included in this PR?
   
   File-read failures are classified by **typed `DataFusionError` variant** in 
the native error path — not by matching error-message prose. The messages come 
from three upstream crates (DataFusion, arrow-rs, object_store) and drift 
across version bumps with no compile-time signal; the typed match arms are 
checked by the compiler.
   
   - **native**: new `SparkError::CannotReadFile { file_path, message }`; 
`try_classify_file_read_error` in the JNI bridge maps a file-read 
`DataFusionError` (`ParquetError`, `ObjectStore`, `ArrowError` wrapping a 
`ParquetError`, `IoError`; unwrapping `Context`/`Shared`) into it. `file_path` 
is taken from `object_store::Error::NotFound` when available. It deliberately 
does **not** match `object_store::Error::Generic` / 
`DataFusionError::Execution`, which also carry non-file config errors (e.g. 
"Hdfs support is not enabled in this build") that must surface as-is.
   - **JVM**: the structured error crosses JNI as the existing 
`CometQueryExecutionException` JSON payload; `SparkErrorConverter` decodes 
`"CannotReadFile"` and, when the native error carried no path, fills it from 
the per-task file list threaded from `CometNativeScanExec` via `CometExecRDD`. 
The spark-3.5 and spark-4.x shims wrap it via 
`QueryExecutionErrors.cannotReadFilesError`. No JVM-side message matching.
   
   ## How are these changes tested?
   
   Coverage across all three layers of the new design:
   
   - **native classifier** (`errors.rs` unit tests, 6): each `DataFusionError` 
variant — `ParquetError`, `ArrowError(ParquetError)`, `ObjectStore::NotFound` 
(asserts the path is carried), `IoError`, `Context`/`Shared` unwrapping — 
classifies as a file-read error; `Execution`/`Internal` (non-file) do **not**.
   - **JVM decode** (`SparkErrorConverterSuite`, 3): `CannotReadFile` → 
`FAILED_READ_FILE` naming the file; empty native path falls back to the 
per-task file list; a native path is preferred over the fallback.
   - **end-to-end** (`CometExecSuite`, 1): writes a valid parquet file, 
corrupts column/page data while leaving the footer intact (so Spark's JVM 
footer pre-check passes and the failure surfaces from the native reader during 
execution), reads it through Comet, and asserts the cause chain contains a 
`FAILED_READ_FILE` exception naming the file. Fails on `main` (raw 
`CometNativeException`), passes here.
   
   `CometExecSuite` (127/0) and `SparkErrorConverterSuite` (11/11) pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: surface native parquet read failures as FAILED_READ_FILE [datafusion-comet]

Reply via email to