JoshRosen commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890727456
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -253,6 +253,9 @@ class FileScanRDD(
// Throw FileNotFoundException even if `ignoreCorruptFiles`
is true
case e: FileNotFoundException if !ignoreMissingFiles =>
throw e
case e @ (_: RuntimeException | _: IOException) if
ignoreCorruptFiles =>
+ if (e.getMessage.contains("Filesystem closed")) {
Review Comment:
I agree that checking exception messages isn't the most robust solution.
Thinking about the risks of exception text matching, I see two potential
classes of problems:
1. _Genuine_ corruption might just-so-happen to produce an exception that
also contains "Filesystem closed", in which case we'd break a user's job.
2. The third-party FileSystem implementation could change its error
messages, which would cause this PR's fix to stop working.
Given the fact that the current code treats essentially _all_ exceptions as
corruption, a potentially-brittle solution which ignores issue (2) and solves
issue (1) _might_ not be a terrible trade-off.
For example, let's say that we're only trying to put a bandaid on the
`DFSClient` closed filesystem issue and aren't trying to solve the broader
class of issues I mentioned in my other comment. In that case, we might be able
to write some sort of helper function which would walk through the exception's
stack frames and look for a frame that calls `DFSClient.checkOpen`. Given that
Hadoop hasn't changed that method in 11+ years, it seems pretty unlikely that
they would break it in a future release.
Then, we could do something like
```
case e: IOException if isDFSClientClosedException(e) => throw e
```
before the current corruption ignoring case.
This would have no chance of mis-identifying corruption as non-corruption.
It's potentially brittle and doesn't necessarily solve all cases, but it could
greatly reduce the likelihood of data loss (even though it doesn't completely
eliminate it). Depending on the context of use, maybe that's an okay short-term
band-aid?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]