Josh Rosen created SPARK-8132:
---------------------------------

             Summary: Race condition if task is cancelled with interruption 
while fetching file dependencies
                 Key: SPARK-8132
                 URL: https://issues.apache.org/jira/browse/SPARK-8132
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.3.1, 1.4.0
            Reporter: Josh Rosen


This is a borderline impossible-to-reproduce bug:

If {{spark.files.overwrite = false}} (the default) and a Spark executor is 
fetching large file dependencies from the driver _and_ the first task that 
triggered file dependency loading is cancelled after it has started copying / 
moving the downloaded file to its target directory, then the executor may be 
put into a bad state where all subsequent tasks fail with errors about refusing 
to overwrite an existing file because its contents differ from the file being 
fetched.

There are a few ways to mitigate this:

- Set {{spark.files.overwrite = false}}.  We should probably remove or 
deprecate this configuration: the only reason that it was added was to work 
around an obscure Spark 0.8-era bug where Spark would delete files out of the 
driver's CWD when running tasks in local mode.  This concern may have been 
mitigated by other changes.  Regardless, there are many environments where this 
feature can safely be disabled.
- Disable {{spark.files.useFetchCache}}, which should probably be off by 
default (see SPARK-8130); this will shorten the window over which the race can 
occur.
- Catch InterruptedException and perform cleanup in our file moving / copying 
code; this is somewhat tricky to reason about / get right because the right 
behavior differs based on whether we're overwriting or creating a new file.

Given that this can be fixed with conf changes for the cases that i've seen, 
I'm not sure that this needs to be a high-priority fix, although I would be 
glad to review patches to clean up / audit this code to properly fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to