schenksj opened a new pull request, #4534: URL: https://github.com/apache/datafusion-comet/pull/4534
## Which issue does this PR close? Closes #4529. ## Rationale for this change When Comet's native DataFusion scan hits a corrupt footer, corrupt page/column data, a truncated/empty file, or a deleted file, it rethrew the raw native message instead of Spark's `FAILED_READ_FILE.NO_HINT`. The native path does not go through Spark's `FileScanRDD`, so `InputFileBlockHolder` is usually unpopulated and the offending path was missing. The prior handling matched only a `^Parquet error:` regex and rewrote to `_LEGACY_ERROR_TEMP_2254` / "File is not a Parquet file.", which both missed most file-read failures and didn't match what Spark's own reader produces. This is a standalone error-compatibility improvement for all native Parquet scans. It was surfaced while working on the Delta Lake contrib integration (Delta's corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE` message and path), so prioritizing it helps that effort. ## What changes are included in this PR? - `CometExecIterator.isFileReadError` classifies file-read failures by matching the specific native phrasings: DataFusion `Parquet error:`, arrow-rs `Parquet argument error` / `failed to fill whole buffer` (corrupt page/column data hit during execution), and object_store `Requested range was invalid` (truncated/empty) and `Object at location ... not found` (deleted). It deliberately does **not** match the broad `Generic <Store> error:` prefix, which also covers non-file config errors (e.g. `Generic HadoopFileSystem error: Hdfs support is not enabled in this build`) that must surface as-is. - `ShimSparkErrorConverter.wrapNativeParquetError` (in both the spark-3.5 and spark-4.x shims) wraps the cause via `QueryExecutionErrors.cannotReadFilesError(cause, path)`. - Per-partition file paths are threaded from `CometNativeScanExec` → `CometExecRDD` → `CometExecIterator` so the wrapped error names the actual file, with an `InputFileBlockHolder` fallback for any path that does populate it. Scope note: this wraps the direct native-scan execution path (`CometNativeScanExec.doExecuteColumnar`); surfacing the path when a scan is fused into a larger native block is a follow-up. ## How are these changes tested? New test in `CometExecSuite` writes a valid parquet file, corrupts column/page data in the middle while leaving the footer intact (so Spark's JVM footer pre-check passes and the failure surfaces from the native reader during execution), reads it through Comet, and asserts the cause chain contains a `FAILED_READ_FILE` exception naming the file. The test fails on `main` (raw `CometNativeException`) and passes with this change. `CometExecSuite` passes (127/0). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
