JoshRosen opened a new pull request #24668: [SPARK-27676][SQL] 
InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
URL: https://github.com/apache/spark/pull/24668
 
 
   ## What changes were proposed in this pull request?
   
   Spark's `InMemoryFileIndex` contains two places where `FileNotFound` 
exceptions are caught and logged as warnings (during [directory 
listing](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274)
 and [block location 
lookup](https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333)).
 This logic was added in #15153 and #21408.
   
   I think that this is a dangerous default behavior because it can mask bugs 
caused by race conditions (e.g. overwriting a table while it's being read) or 
S3 consistency issues (there's more discussion on this in the [JIRA 
ticket](https://issues.apache.org/jira/browse/SPARK-27676)). Failing fast when 
we detect missing files is not sufficient to make concurrent table reads/writes 
or S3 listing safe (there are other classes of eventual consistency issues to 
worry about), but I think it's still beneficial to throw exceptions and 
fail-fast on the subset of inconsistencies / races that we _can_ detect because 
that increases the likelihood that an end user will notice the problem and 
investigate further.
   
   There may be some cases where users _do_ want to ignore missing files, but I 
think that should be an opt-in behavior via the existing 
`spark.sql.files.ignoreMissingFiles` flag (the current behavior is itself 
race-prone because a file might be be deleted between catalog listing and query 
execution time, triggering FileNotFoundExceptions on executors (which are 
handled in a way that _does_ respect `ignoreMissingFIles`)).
   
   This PR updates `InMemoryFileIndex` to guard the 
log-and-ignore-FileNotFoundException behind the existing 
`spark.sql.files.ignoreMissingFiles` flag.
   
   **Note**: this is a change of default behavior, so I think it needs to be 
mentioned in release notes.
   
   ## How was this patch tested?
   
   Updated existing test case to test with the `ignoreMissingFIles` flag (both 
true and false).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to