JoshRosen commented on code in PR #36775:
URL: https://github.com/apache/spark/pull/36775#discussion_r890727456


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala:
##########
@@ -253,6 +253,9 @@ class FileScanRDD(
                   // Throw FileNotFoundException even if `ignoreCorruptFiles` 
is true
                   case e: FileNotFoundException if !ignoreMissingFiles => 
throw e
                   case e @ (_: RuntimeException | _: IOException) if 
ignoreCorruptFiles =>
+                    if (e.getMessage.contains("Filesystem closed")) {

Review Comment:
   I agree that checking exception messages isn't the most robust solution.
   
   Thinking about the risks of exception text matching, I see two potential 
classes of problems:
   
   1. _Genuine_ corruption might just-so-happen to produce an exception that 
also contains "Filesystem closed", in which case we'd break a user's job.
   2. The third-party FileSystem implementation could change its error 
messages, which would cause this PR's fix to stop working.
   
   Given the fact that the current code treats essentially _all_ exceptions as 
corruption, a potentially-brittle solution which ignores issue (2) and solves 
issue (1) _might_ not be a terrible trade-off. 
   
   For example, let's say that we're only trying to put a bandaid on the 
`DFSClient` closed filesystem issue and aren't trying to solve the broader 
class of issues I mentioned in my other comment. In that case, we might be able 
to write some sort of helper function which would walk through the exception's 
stack frames and look for a frame that calls `DFSClient.checkOpen`. Given that 
Hadoop hasn't changed that method in 11+ years, it seems pretty unlikely that 
they would break it in a future release.
   
   Then, we could do something like
   
   ```
   case e: IOException if isDFSClientClosedException(e) => throw e
   ```
   
   before the current corruption ignoring case.
   
   This would have no chance of mis-identifying corruption as non-corruption. 
It's potentially brittle and doesn't necessarily solve all cases, but it could 
greatly reduce the likelihood of data loss (even though it doesn't completely 
eliminate it). Depending on the context of use, maybe that's an okay short-term 
band-aid?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to