[
https://issues.apache.org/jira/browse/SPARK-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Sun updated SPARK-22902:
------------------------------
Priority: Minor (was: Major)
> do not count failures if the speculative task is killed as the same task
> finished in other executor
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-22902
> URL: https://issues.apache.org/jira/browse/SPARK-22902
> Project: Spark
> Issue Type: Bug
> Components: Block Manager
> Affects Versions: 2.1.1
> Reporter: Keith Sun
> Priority: Minor
>
> It is a logic issue, so i did not include much env but my log in the ticket.
> Spark conf related to this issue :
> spark.task.maxFailures=2
> spark.speculation=true
> My case is this : my task 239 failed first on an executor and then restarted
> in another executor while due to it is kind of running slow, spark started
> another speculative job as we set speculative execution as true.
> After some short time, the second task finished and then spark killed the
> specutive task.
> But this cause the whole spark job aborted as the task failure is 2 (first
> failure due to some other issue + the killed specutive one).
> This is confusing as the task 239 is actually finished successfully and the
> specutive is killed ,not failed by itself.
> Shall we ignore the speculative failure caused by active kill ?
> On the spark configuration doc, i found the explanation to the
> spark.taskmaxFailure :
> {noformat}
> Number of failures of any particular task before giving up on the job. The
> total number of failures spread across different tasks will not cause the job
> to fail; a particular task has to fail this number of attempts. Should be
> greater than or equal to 1. Number of allowed retries = this value - 1.
> {noformat}
> My log :
> {noformat}
> 17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID
> 10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239,
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:36:18 INFO TaskSetManager:
> Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx,
> executor 208: org.apache.spark.SparkException (Task failed while writing
> rows) [duplicate 1]
> 17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID
> 10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239,
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on
> host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms
> 17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID
> 15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239,
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in
> stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1
> succeeded on host-620-1507-038.lvs01xxxx
> 17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID
> 10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000)
> 17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142)
> failed, but another instance of the task has already succeeded, so not
> re-queuing the task to be re-executed.
> 17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times;
> aborting job
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled
> 17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at
> SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage
> failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task
> 239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361):
> org.apache.spark.SparkException: Task failed while writing rows
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]