[jira] [Updated] (SPARK-22902) do not count failures if the speculative task is killed as the same task finished in other executor

Keith Sun (JIRA) Tue, 26 Dec 2017 01:52:53 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Keith Sun updated SPARK-22902:
------------------------------
    Description: 
It is a logic issue, so i did not include much env but my log in the ticket.

Spark conf related to this issue :
spark.task.maxFailures=2
spark.speculation=true 

My case is this : my task 239 failed first  on an executor and then restarted 
in another executor while due to it is kind of running slow, spark started 
another speculative job as we set speculative execution as true.  
After some short time, the second task finished and then spark killed the 
specutive task.
But this cause the whole spark job aborted as the task failure is 2 (first 
failure due to some other issue + the killed specutive one).

This is confusing as the task 239 is actually finished successfully and the 
specutive is killed ,not failed by itself.

Shall we ignore the speculative failure caused by active kill ?
On the spark configuration doc, i found the explanation to the 
spark.taskmaxFailure :

{noformat}
Number of failures of any particular task before giving up on the job. The 
total number of failures spread across different tasks will not cause the job 
to fail; a particular task has to fail this number of attempts. Should be 
greater than or equal to 1. Number of allowed retries = this value - 1.
{noformat}


My log :

{noformat}
17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID 
10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239, PROCESS_LOCAL, 
5910 bytes)
17/12/25 12:36:18 INFO TaskSetManager: 
Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx, 
executor 208: org.apache.spark.SparkException (Task failed while writing rows) 
[duplicate 1]
17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID 
10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239, PROCESS_LOCAL, 
5910 bytes)

17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on 
host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms
17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID 
15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239, PROCESS_LOCAL, 
5910 bytes)
17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in 
stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1 succeeded 
on host-620-1507-038.lvs01xxxx
17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID 
10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000)

17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142) 
failed, but another instance of the task has already succeeded, so not 
re-queuing the task to be re-executed.
17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times; 
aborting job
17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1
17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled
17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at 
SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage 
failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task 
239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361): 
org.apache.spark.SparkException: Task failed while writing rows

{noformat}


  was:
It is a logic issue, so i did not include much env but my log in the ticket.

Spark conf related to this issue :
spark.task.maxFailures=2
spark.speculation=true 

My case is this : my task 239 failed first  on an executor and then restarted 
in another executor while due to it is kind of running slow, spark started 
another speculative job as we set speculative execution as true.  
After some short time, the second task finished and then spark killed the 
specutive task.
But this cause the whole spark job aborted as the task failure is 2 (first 
failure due to some other issue + the killed specutive one).

This is confusing as the task 239 is actually finished successfully and the 
specutive is killed ,not failed by itself.

Shall we ignore the speculative failure caused by active kill ?

My log :

{noformat}
17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID 
10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239, PROCESS_LOCAL, 
5910 bytes)
17/12/25 12:36:18 INFO TaskSetManager: 
Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx, 
executor 208: org.apache.spark.SparkException (Task failed while writing rows) 
[duplicate 1]
17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID 
10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239, PROCESS_LOCAL, 
5910 bytes)

17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on 
host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms
17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID 
15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239, PROCESS_LOCAL, 
5910 bytes)
17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in 
stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1 succeeded 
on host-620-1507-038.lvs01xxxx
17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID 
10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000)

17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142) 
failed, but another instance of the task has already succeeded, so not 
re-queuing the task to be re-executed.
17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times; 
aborting job
17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1
17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled
17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at 
SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage 
failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task 
239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361): 
org.apache.spark.SparkException: Task failed while writing rows

{noformat}



> do not count failures if the speculative task is killed as the same task 
> finished in other executor
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22902
>                 URL: https://issues.apache.org/jira/browse/SPARK-22902
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.1.1
>            Reporter: Keith Sun
>
> It is a logic issue, so i did not include much env but my log in the ticket.
> Spark conf related to this issue :
> spark.task.maxFailures=2
> spark.speculation=true 
> My case is this : my task 239 failed first  on an executor and then restarted 
> in another executor while due to it is kind of running slow, spark started 
> another speculative job as we set speculative execution as true.  
> After some short time, the second task finished and then spark killed the 
> specutive task.
> But this cause the whole spark job aborted as the task failure is 2 (first 
> failure due to some other issue + the killed specutive one).
> This is confusing as the task 239 is actually finished successfully and the 
> specutive is killed ,not failed by itself.
> Shall we ignore the speculative failure caused by active kill ?
> On the spark configuration doc, i found the explanation to the 
> spark.taskmaxFailure :
> {noformat}
> Number of failures of any particular task before giving up on the job. The 
> total number of failures spread across different tasks will not cause the job 
> to fail; a particular task has to fail this number of attempts. Should be 
> greater than or equal to 1. Number of allowed retries = this value - 1.
> {noformat}
> My log :
> {noformat}
> 17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID 
> 10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:36:18 INFO TaskSetManager: 
> Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx, 
> executor 208: org.apache.spark.SparkException (Task failed while writing 
> rows) [duplicate 1]
> 17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID 
> 10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on 
> host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms
> 17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID 
> 15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in 
> stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1 
> succeeded on host-620-1507-038.lvs01xxxx
> 17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID 
> 10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000)
> 17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142) 
> failed, but another instance of the task has already succeeded, so not 
> re-queuing the task to be re-executed.
> 17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times; 
> aborting job
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled
> 17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at 
> SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage 
> failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task 
> 239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361): 
> org.apache.spark.SparkException: Task failed while writing rows
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-22902) do not count failures if the speculative task is killed as the same task finished in other executor

Reply via email to