andygrove commented on PR #4536:
URL:
https://github.com/apache/datafusion-comet/pull/4536#issuecomment-4586952074
Nice work on this. Classifying file-read failures by typed `DataFusionError`
variant rather than matching message prose is the right call, since those
messages come from three upstream crates and drift silently across version
bumps. Deliberately leaving `Execution` and `object_store::Generic` unmatched
so config errors still surface as-is is a thoughtful touch too.
The CI failures sort into three root causes, and the encouraging part is
that the feature itself works. The Spark 4.0 PR Build passes the end-to-end
test, so the wrapping is sound. The rest is reachability and message-matching.
**1. The spark-3.4 shim was not updated.**
`spark-3.4/.../ShimSparkErrorConverter.scala` is a separate full
implementation, and it does not have the `CannotReadFile` case, only 3.5 and
4.x got it. So on Spark 3.4 the error is never wrapped and the new
`CometExecSuite` test fails with
`messages.exists(_.contains("FAILED_READ_FILE")) was false`. Could the case be
added here too, adjusting for any 3.4 differences in `cannotReadFilesError`?
**2. The tests assert on a version-dependent message substring.** On 3.5 the
error actually is wrapped correctly. The stack trace shows
`cannotReadFilesError` being reached through the new shim case. The test still
fails because it asserts `getMessage.contains("FAILED_READ_FILE")`, and on 3.5
the rendered message is `Encountered error while reading file ... Details:`
with no `FAILED_READ_FILE` literal. Spark 4.0 prepends `[FAILED_READ_FILE]` to
`getMessage`, which is why only 3.4 and 3.5 fail. This hits two
`SparkErrorConverterSuite` tests and the `CometExecSuite` test. Could these
assert on the error class instead, via `SparkThrowable.getErrorClass` on 3.x
and `getCondition` on 4.0 behind the shim? The third decode test passes
precisely because it checks the path rather than the literal, which is a good
hint about the robust assertion style.
**3. The corrupt/missing-file message needs to match what Spark's tests
expect.** The Spark SQL Test jobs fail on `ParquetQuerySuite`
"Enabling/disabling ignoreCorruptFiles" and "ignoreMissingFiles using parquet"
because Spark asserts the cause contains `is not a Parquet file`, while Comet's
native reader reports `Invalid Parquet file. Corrupt footer`. For these to
pass, the surfaced message has to carry Spark's wording rather than
DataFusion's. It would be worth deciding where to normalize that. The
classifier already has the typed variant in hand, so it could map a
corrupt-footer `ParquetError` to a message Spark's tests expect, and similarly
for the truncated/too-short file case that drives the missing-files test.
There is also a behavior question hiding in those test names. They exercise
`spark.sql.files.ignoreCorruptFiles` and `ignoreMissingFiles`, which tell Spark
to skip the bad file rather than fail the query. Does Comet's native scan honor
those configs, or does it now always throw `FAILED_READ_FILE`? If it does not
honor them, enabling the config would no longer skip corrupt or missing files
under Comet, which would be a behavior gap on top of the wording mismatch.
**Smaller thing on the native side.** In `try_classify_file_read_error`,
`DataFusionError::IoError(_)` is classified as a file read unconditionally.
Parquet and ObjectStore are clearly file reads, but `IoError` can also come
from non-scan paths like spill or shuffle. Is there a risk this mislabels a
non-scan IO error as `FAILED_READ_FILE` and attaches a per-task path to it?
Scoping that arm a bit more tightly might be safer. The guard arm also calls
`try_classify_file_read_error(source)` twice (`.is_some()` then `.unwrap()`),
which an `if let Some(...)` would tidy up.
*Disclosure: I used Claude Code to help review this PR, including pulling
the failing CI logs and tracing each failure to its root cause.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]