[ 
https://issues.apache.org/jira/browse/SPARK-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8513:
------------------------------------
    Target Version/s: 1.6.0  (was: 1.5.0)

> _temporary may be left undeleted when a write job committed with 
> FileOutputCommitter fails due to a race condition
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-8513
>                 URL: https://issues.apache.org/jira/browse/SPARK-8513
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.2.2, 1.3.1, 1.4.0
>            Reporter: Cheng Lian
>
> To reproduce this issue, we need a node with relatively more cores, say 32 
> (e.g., Spark Jenkins builder is a good candidate).  With such a node, the 
> following code should be relatively easy to reproduce this issue:
> {code}
> sqlContext.range(0, 10).repartition(32).select('id / 
> 0).write.mode("overwrite").parquet("file:///tmp/foo")
> {code}
> You may observe similar log lines as below:
> {noformat}
> 01:58:27.682 pool-1-thread-1-ScalaTest-running-CommitFailureTestRelationSuite 
> WARN FileUtil: Failed to delete file or dir 
> [/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-a918b285-fa59-4a29-857e-a95e38fa355a/_temporary/0/_temporary]:
>  it still exists.
> {noformat}
> The reason is that, for a Spark job with multiple tasks, when a task fails 
> after multiple retries, the job gets canceled on driver side.  At the same 
> time, all child tasks of this job also get canceled.  However, task 
> cancelation is asynchronous.  This means, some tasks may still be running 
> when the job is already killed on driver side.
> With this in mind, the following execution order may cause the log line 
> mentioned above:
> # Job {{A}} spawns 32 tasks to write the Parquet file
>   Since {{ParquetOutputCommitter}} is a subclass of {{FileOutputClass}}, a 
> temporary directory {{D1}} is created to hold output files of different task 
> attempts.
> # Task {{a1}} fails after several retries first because of the division by 
> zero error
> # Task {{a1}} aborts the Parquet write task and tries to remove its task 
> attempt output directory {{d1}} (a sub-directory of {{D1}})
> # Job {{A}} gets canceled on driver side, all the other 31 tasks also get 
> canceled *asynchronously*
> # {{ParquetOutputCommitter.abortJob()}} tries to remove {{D1}} by first 
> removing all its child files/directories first
>   Note that when testing with local directory, {{RawLocalFileSystem}} simply 
> calls {{java.io.File.delete()}} to deletion, and only empty directories can 
> be deleted.
> # Because tasks are canceled asynchronously, some other task, say {{a2}}, may 
> just get scheduled and create its own task attempt directory {{d2}} under 
> {{D1}}
> # Now {{ParquetOutputCommitter.abortJob()}} tries to finally remove {{D1}} 
> itself, but fails because {{d2}} makes {{D1}} non-empty again
> Notice that this bug affects all Spark jobs that writes files with 
> {{FileOutputCommitter}} and its subclasses which create and delete temporary 
> directories.
> One of the possible way to fix this issue can be making task cancellation 
> synchronous, but this also increases latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to