JoshRosen commented on a change in pull request #24668: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles URL: https://github.com/apache/spark/pull/24668#discussion_r295095364
########## File path: docs/sql-migration-guide-upgrade.md ########## @@ -130,6 +130,8 @@ license: | - Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behaviour of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString` to `true`. + - Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless `spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during `REFRESH TABLE`), not during query execution: the net change is that `spark.sql.files.ignoreMissingFiles` is now obeyed during table file listing / query planning, not only at query execution time. Review comment: Out of curiosity, can you give an example of a user workload which is dependent on ignoring files that go listing between the initial listing and a recursive listing / stat? I understand the case of wanting to ignore completely-missing table roots but can't come up with an intuitive example of when someone would want to purposely ignore the "list inconsistency" case. I think that only ignoring deletions at the root and _not_ ignoring deletions at lower levels of the listing represents a fair compromise: we may fail to reliably detect _all_ race conditions but the additional detection still provides some incremental value and I feel like it's unlikely to result in "false positives" which would break real workloads. @marmbrus @rxin @hvanhovell @brkyvz, do any of you have opinions on this PR's changes? The user-facing migration guide documentation is a bit convoluted here because it's trying to describe what I think is pretty narrow set of circumstances where this change might break workloads. If we're not comfortable tying this to the `ignoreMissingFiles` configuration, maybe we could add a new configuration specifically for this behavior? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
