schenksj opened a new issue, #4529:
URL: https://github.com/apache/datafusion-comet/issues/4529

   ### Problem
   
   When a native Parquet scan hits a corrupt footer, a truncated/empty file, or 
a deleted file, Comet rethrows the raw DataFusion / object_store message:
   
   - `Parquet error: ...` (corrupt footer etc.)
   - `Requested range was invalid` (0-byte / truncated file)
   - `Object at location ... not found` (deleted file)
   
   Spark's own reader surfaces these as `FAILED_READ_FILE.NO_HINT` carrying the 
offending file path, and tests/tools assert on that shape. Comet's native path 
does **not** go through Spark's `FileScanRDD`, so `InputFileBlockHolder` is 
usually unpopulated and the path is missing from any wrapped error.
   
   ### Proposed fix
   
   - `CometExecIterator.isFileReadError` classifies file-read failures by 
matching those specific IO phrasings -- deliberately **not** the broad `Generic 
<Store> error:` prefix, which also covers non-file config errors (e.g. `Generic 
HadoopFileSystem error: Hdfs support is not enabled in this build`) that must 
surface as-is.
   - `ShimSparkErrorConverter.wrapNativeParquetError` (in both the spark-3.5 
and spark-4.x shims) wraps the cause via 
`QueryExecutionErrors.cannotReadFilesError(cause, path)`.
   - Thread per-partition file paths from `CometNativeScanExec` -> 
`CometNativeExec` / `CometExecRDD` -> `CometExecIterator` so the wrapped error 
names the actual file, with an `InputFileBlockHolder` fallback for any path 
that does populate it.
   
   ### Relationship to the Delta integration
   
   Standalone error-compatibility improvement for all native Parquet scans. It 
is **required for** the in-progress Delta Lake contrib integration (Delta's 
corrupt-file / broken-checkpoint suites assert the `FAILED_READ_FILE` message 
and path), so it would help to prioritize it accordingly. A PR will follow 
shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to