[ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837552#comment-16837552
 ] 

Michael Armbrust commented on SPARK-27676:
------------------------------------------

I tend to agree that all cases where we chose to ignore missing files should be 
hidden behind the existing {{spark.sql.files.ignoreMissingFiles}} flag.

> InMemoryFileIndex should hard-fail on missing files instead of logging and 
> continuing
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-27676
>                 URL: https://issues.apache.org/jira/browse/SPARK-27676
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Josh Rosen
>            Priority: Major
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
> exceptions are caught and logged as warnings (during [directory 
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
>  and [block location 
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
>  I think that this is a dangerous default behavior and would prefer that 
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by 
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. 
> If a folder is deleted at any time between the paths were resolved and the 
> file catalog can check for the folder, the Spark job fails. This may abruptly 
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These 
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In 
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm 
> never expecting to delete files then I'd prefer to fail my job if Spark sees 
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a 
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
>   date=1/
>     region=west/
>        0.parquet
>        1.parquet
>     region=east/
>        0.parquet
>        1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
> multiple rounds of file listing, first listing {{/data/date=1}} to discover 
> the partitions for that date, then listing within each partition to discover 
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's 
> possible that the first listing will show the {{region=west}} and 
> {{region=east}} partitions existing and then the next-level listing fails to 
> return any for some of the directories (e.g. {{/data/date=1/}} returns files 
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A 
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this 
> case then I'd be able to fail the job in this case where we _definitely_ know 
> that the S3 listing is inconsistent (failing here doesn't guard against _all_ 
> potential S3 list inconsistency issues (e.g. back-to-back listings which both 
> return a subset of the true set of objects), but I think it's still an 
> improvement to fail for the subset of cases that we _can_ detect even if 
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a 
> file is deleted once a Spark job expects to read it then that can cause 
> problems at multiple layers, both in the driver (multiple rounds of file 
> listing) and in executors (if the deletion occurs after the construction of 
> the catalog but before the scheduling of the read tasks); I think the 
> original patch only resolved the problem for the driver (unless I'm missing 
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted 
> files during file index listing" behavior should be guarded behind a feature 
> flag which defaults to {{false}}, consistent with the existing 
> {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
> flags (which both default to false).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to