andygrove commented on PR #4536:
URL: 
https://github.com/apache/datafusion-comet/pull/4536#issuecomment-4586952074

   Nice work on this. Classifying file-read failures by typed `DataFusionError` 
variant rather than matching message prose is the right call, since those 
messages come from three upstream crates and drift silently across version 
bumps. Deliberately leaving `Execution` and `object_store::Generic` unmatched 
so config errors still surface as-is is a thoughtful touch too.
   
   The CI failures sort into three root causes, and the encouraging part is 
that the feature itself works. The Spark 4.0 PR Build passes the end-to-end 
test, so the wrapping is sound. The rest is reachability and message-matching.
   
   **1. The spark-3.4 shim was not updated.** 
`spark-3.4/.../ShimSparkErrorConverter.scala` is a separate full 
implementation, and it does not have the `CannotReadFile` case, only 3.5 and 
4.x got it. So on Spark 3.4 the error is never wrapped and the new 
`CometExecSuite` test fails with 
`messages.exists(_.contains("FAILED_READ_FILE")) was false`. Could the case be 
added here too, adjusting for any 3.4 differences in `cannotReadFilesError`?
   
   **2. The tests assert on a version-dependent message substring.** On 3.5 the 
error actually is wrapped correctly. The stack trace shows 
`cannotReadFilesError` being reached through the new shim case. The test still 
fails because it asserts `getMessage.contains("FAILED_READ_FILE")`, and on 3.5 
the rendered message is `Encountered error while reading file ... Details:` 
with no `FAILED_READ_FILE` literal. Spark 4.0 prepends `[FAILED_READ_FILE]` to 
`getMessage`, which is why only 3.4 and 3.5 fail. This hits two 
`SparkErrorConverterSuite` tests and the `CometExecSuite` test. Could these 
assert on the error class instead, via `SparkThrowable.getErrorClass` on 3.x 
and `getCondition` on 4.0 behind the shim? The third decode test passes 
precisely because it checks the path rather than the literal, which is a good 
hint about the robust assertion style.
   
   **3. The corrupt/missing-file message needs to match what Spark's tests 
expect.** The Spark SQL Test jobs fail on `ParquetQuerySuite` 
"Enabling/disabling ignoreCorruptFiles" and "ignoreMissingFiles using parquet" 
because Spark asserts the cause contains `is not a Parquet file`, while Comet's 
native reader reports `Invalid Parquet file. Corrupt footer`. For these to 
pass, the surfaced message has to carry Spark's wording rather than 
DataFusion's. It would be worth deciding where to normalize that. The 
classifier already has the typed variant in hand, so it could map a 
corrupt-footer `ParquetError` to a message Spark's tests expect, and similarly 
for the truncated/too-short file case that drives the missing-files test.
   
   There is also a behavior question hiding in those test names. They exercise 
`spark.sql.files.ignoreCorruptFiles` and `ignoreMissingFiles`, which tell Spark 
to skip the bad file rather than fail the query. Does Comet's native scan honor 
those configs, or does it now always throw `FAILED_READ_FILE`? If it does not 
honor them, enabling the config would no longer skip corrupt or missing files 
under Comet, which would be a behavior gap on top of the wording mismatch.
   
   **Smaller thing on the native side.** In `try_classify_file_read_error`, 
`DataFusionError::IoError(_)` is classified as a file read unconditionally. 
Parquet and ObjectStore are clearly file reads, but `IoError` can also come 
from non-scan paths like spill or shuffle. Is there a risk this mislabels a 
non-scan IO error as `FAILED_READ_FILE` and attaches a per-task path to it? 
Scoping that arm a bit more tightly might be safer. The guard arm also calls 
`try_classify_file_read_error(source)` twice (`.is_some()` then `.unwrap()`), 
which an `if let Some(...)` would tidy up.
   
   *Disclosure: I used Claude Code to help review this PR, including pulling 
the failing CI logs and tracing each failure to its root cause.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to