[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

feiwang (Jira) Wed, 11 Sep 2019 20:51:33 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


feiwang updated SPARK-29037:
----------------------------
    Description: 
When we insert overwrite a partition of table.
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly,  when this task complete, it will save output to committedTaskPath, 
when all tasks of this stage success, all task output under committedTaskPath 
will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in committedTaskPath, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this 
committedTaskPath dir.

And when the task commit stage of new application success, all task output 
under this committedTaskPath, which contains parts of old application's task 
output , would be moved to destination dir and the result is duplicated.



  was:
When we insert overwrite a partition of table.
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly,  when this task complete, it will save output to 
when all tasks of this stage success, all task output under staging dir will be 
moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in staging dir, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this staging 
dir.

And when the task commit stage of new application success, all task output 
under this staging dir, which contains parts of old application's task output , 
would be moved to destination dir and the result is duplicated.




> [Core] Spark gives duplicate result when an application was killed and rerun
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-29037
>                 URL: https://issues.apache.org/jira/browse/SPARK-29037
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.3.3
>            Reporter: feiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

Reply via email to