[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865359#comment-15865359
 ] 

xukun edited comment on SPARK-18113 at 2/14/17 8:34 AM:
--------------------------------------------------------

[~jinxing6...@126.com]
yes. reproduce steps:

1. executor is running task x.0
2. appmaster get completed container statuses and tell driver to remove 
executor (now executor is not killed,it is still running.)
3. driver process CompletionEvent and TaskSetManager will call:
{code}
sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
{code}
4.task x.0 request a task commit and coordinator approves commit and sends 
response message
5.drvier receive AskPermissionToCommitOutput and authorize attemptNumber=0 for 
partition x
6.driver starting task x.1 because of ExecutorLost and enter into task deadloop

When appmaster  get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while. After TaskSetManager call
{code}
sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
{code}
, task x.0 will send AskPermissionToCommitOutput. So driver enter into task 
deadloop.



was (Author: xukun):
[~jinxing6...@126.com]
yes. reproduce steps:

1. executor is running task x.0
2. appmaster get completed container statuses and tell driver to remove 
executor (now executor is not killed,it is still running.)
3. driver process CompletionEvent and TaskSetManager will call:
{code}
sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
{code}
4.task x.0 request a task commit and coordinator approves commit and sends 
response message
5.drvier receive AskPermissionToCommitOutput and authorize attemptNumber=0 for 
partition x
6.driver starting task x.1 and enter into task deadloop

When appmaster  get completed container statuses , executor is not preempted 
immediately. executor will be killed after a while. After TaskSetManager call
{code}
sched.dagScheduler.taskEnded(
            tasks(index), Resubmitted, null, Seq.empty, info)
{code}
, task x.0 will send AskPermissionToCommitOutput. So driver enter into task 
deadloop.


> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-18113
>                 URL: https://issues.apache.org/jira/browse/SPARK-18113
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.0.1
>         Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>            Reporter: xuqing
>            Assignee: jin xing
>             Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>         at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>         at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>         at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
>         at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
>         at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
>         at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>         at org.apache.spark.scheduler.Task.run(Task.scala:86)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>         at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)
>         at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>         ... 13 more
> ...
> 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
> ...
> 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: 
> attempt_201610250536_0002_m_000024_119: Not committed because the driver did 
> not authorize commit
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to