schenksj opened a new pull request, #4534:
URL: https://github.com/apache/datafusion-comet/pull/4534

   ## Which issue does this PR close?
   
   Closes #4529.
   
   ## Rationale for this change
   
   When Comet's native DataFusion scan hits a corrupt footer, corrupt 
page/column data, a truncated/empty file, or a deleted file, it rethrew the raw 
native message instead of Spark's `FAILED_READ_FILE.NO_HINT`. The native path 
does not go through Spark's `FileScanRDD`, so `InputFileBlockHolder` is usually 
unpopulated and the offending path was missing.
   
   The prior handling matched only a `^Parquet error:` regex and rewrote to 
`_LEGACY_ERROR_TEMP_2254` / "File is not a Parquet file.", which both missed 
most file-read failures and didn't match what Spark's own reader produces.
   
   This is a standalone error-compatibility improvement for all native Parquet 
scans. It was surfaced while working on the Delta Lake contrib integration 
(Delta's corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE` 
message and path), so prioritizing it helps that effort.
   
   ## What changes are included in this PR?
   
   - `CometExecIterator.isFileReadError` classifies file-read failures by 
matching the specific native phrasings: DataFusion `Parquet error:`, arrow-rs 
`Parquet argument error` / `failed to fill whole buffer` (corrupt page/column 
data hit during execution), and object_store `Requested range was invalid` 
(truncated/empty) and `Object at location ... not found` (deleted). It 
deliberately does **not** match the broad `Generic <Store> error:` prefix, 
which also covers non-file config errors (e.g. `Generic HadoopFileSystem error: 
Hdfs support is not enabled in this build`) that must surface as-is.
   - `ShimSparkErrorConverter.wrapNativeParquetError` (in both the spark-3.5 
and spark-4.x shims) wraps the cause via 
`QueryExecutionErrors.cannotReadFilesError(cause, path)`.
   - Per-partition file paths are threaded from `CometNativeScanExec` → 
`CometExecRDD` → `CometExecIterator` so the wrapped error names the actual 
file, with an `InputFileBlockHolder` fallback for any path that does populate 
it.
   
   Scope note: this wraps the direct native-scan execution path 
(`CometNativeScanExec.doExecuteColumnar`); surfacing the path when a scan is 
fused into a larger native block is a follow-up.
   
   ## How are these changes tested?
   
   New test in `CometExecSuite` writes a valid parquet file, corrupts 
column/page data in the middle while leaving the footer intact (so Spark's JVM 
footer pre-check passes and the failure surfaces from the native reader during 
execution), reads it through Comet, and asserts the cause chain contains a 
`FAILED_READ_FILE` exception naming the file. The test fails on `main` (raw 
`CometNativeException`) and passes with this change. `CometExecSuite` passes 
(127/0).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to