[
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837552#comment-16837552
]
Michael Armbrust commented on SPARK-27676:
------------------------------------------
I tend to agree that all cases where we chose to ignore missing files should be
hidden behind the existing {{spark.sql.files.ignoreMissingFiles}} flag.
> InMemoryFileIndex should hard-fail on missing files instead of logging and
> continuing
> -------------------------------------------------------------------------------------
>
> Key: SPARK-27676
> URL: https://issues.apache.org/jira/browse/SPARK-27676
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0
> Reporter: Josh Rosen
> Priority: Major
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}}
> exceptions are caught and logged as warnings (during [directory
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
> and [block location
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
> I think that this is a dangerous default behavior and would prefer that
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files.
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths.
> If a folder is deleted at any time between the paths were resolved and the
> file catalog can check for the folder, the Spark job fails. This may abruptly
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm
> never expecting to delete files then I'd prefer to fail my job if Spark sees
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
> date=1/
> region=west/
> 0.parquet
> 1.parquet
> region=east/
> 0.parquet
> 1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform
> multiple rounds of file listing, first listing {{/data/date=1}} to discover
> the partitions for that date, then listing within each partition to discover
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's
> possible that the first listing will show the {{region=west}} and
> {{region=east}} partitions existing and then the next-level listing fails to
> return any for some of the directories (e.g. {{/data/date=1/}} returns files
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this
> case then I'd be able to fail the job in this case where we _definitely_ know
> that the S3 listing is inconsistent (failing here doesn't guard against _all_
> potential S3 list inconsistency issues (e.g. back-to-back listings which both
> return a subset of the true set of objects), but I think it's still an
> improvement to fail for the subset of cases that we _can_ detect even if
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a
> file is deleted once a Spark job expects to read it then that can cause
> problems at multiple layers, both in the driver (multiple rounds of file
> listing) and in executors (if the deletion occurs after the construction of
> the catalog but before the scheduling of the read tasks); I think the
> original patch only resolved the problem for the driver (unless I'm missing
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted
> files during file index listing" behavior should be guarded behind a feature
> flag which defaults to {{false}}, consistent with the existing
> {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}}
> flags (which both default to false).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]