[GitHub] [spark] JoshRosen commented on a change in pull request #24668: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles

GitBox Tue, 18 Jun 2019 19:08:23 -0700

JoshRosen commented on a change in pull request #24668: [SPARK-27676][SQL][SS] 
InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles
URL: https://github.com/apache/spark/pull/24668#discussion_r295095364


 ##########
 File path: docs/sql-migration-guide-upgrade.md
 ##########
 @@ -130,6 +130,8 @@ license: |
 
   - Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary 
comparisons with dates/timestamps. The previous behaviour of casting 
`Date/Timestamp` to `String` can be restored by setting 
`spark.sql.legacy.typeCoercion.datetimeToString` to `true`.
 
+  - Since Spark 3.0, if files or subdirectories disappear during recursive 
directory listing (i.e. they appear in an intermediate listing but then cannot 
be read or listed during later phases of the recursive directory listing, due 
to either concurrent file deletions or object store consistency issues) then 
the listing will fail with an exception unless 
`spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous 
versions, these missing files or subdirectories would be ignored. Note that 
this change of behavior only applies during initial table file listing (or 
during `REFRESH TABLE`), not during query execution: the net change is that 
`spark.sql.files.ignoreMissingFiles` is now obeyed during table file listing / 
query planning, not only at query execution time.
 
 Review comment:
   Out of curiosity, can you give an example of a user workload which is 
dependent on ignoring files that go listing between the initial listing and a 
recursive listing / stat? I understand the case of wanting to ignore 
completely-missing table roots but can't come up with an intuitive example of 
when someone would want to purposely ignore the "list inconsistency" case.
   
   I think that only ignoring deletions at the root and _not_ ignoring 
deletions at lower levels of the listing represents a fair compromise: we may 
fail to reliably detect _all_ race conditions but the additional detection 
still provides some incremental value and I feel like it's unlikely to result 
in "false positives" which would break real workloads.
   
   @marmbrus @rxin @hvanhovell @brkyvz, do any of you have opinions on this 
PR's changes? The user-facing migration guide documentation is a bit convoluted 
here because it's trying to describe what I think is pretty narrow set of 
circumstances where this change might break workloads.
   
   If we're not comfortable tying this to the `ignoreMissingFiles` 
configuration, maybe we could add a new configuration specifically for this 
behavior?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] JoshRosen commented on a change in pull request #24668: [SPARK-27676][SQL][SS] InMemoryFileIndex should respect spark.sql.files.ignoreMissingFiles

Reply via email to