[
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xuqing updated SPARK-18113:
---------------------------
Description:
Executor sends *AskPermissionToCommitOutput* to driver, driver receives this
message and handles it. But executor does not get response in
*spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in
authorizedCommittersByStage is locked forever. Driver enters into infinite loop.
h4. Driver Log:
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID
110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110,
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2,
partition: 24, attemptNumber: 0
...
16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing,
stage: 2, partition: 24, attempt: 0
...
16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID
28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 28732,
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2,
partition: 24, attemptNumber: 1
16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing,
stage: 2, partition: 24, attempt: 1
...
16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID
28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
h4. Executor Log:
...
16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
...
16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message =
AskPermissionToCommitOutput(2,24,0)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds].
This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at
org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
at
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 13 more
...
16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
...
16/10/25 05:39:24 INFO SparkHadoopMapRedUtil:
attempt_201610250536_0002_m_000024_119: Not committed because the driver did
not authorize commit
...
was:
Executor sends *AskPermissionToCommitOutput* to driver, and driver receive this
message and handle it. But executor does not get response in
*spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in
authorizedCommittersByStage is locked forever. The driver enters into infinite
loop.
h4. Driver Log:
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID
110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110,
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2,
partition: 24, attemptNumber: 0
...
16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing,
stage: 2, partition: 24, attempt: 0
...
16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID
28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 28732,
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2,
partition: 24, attemptNumber: 1
16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing,
stage: 2, partition: 24, attempt: 1
...
16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID
28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
h4. Executor Log:
...
16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
...
16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message =
AskPermissionToCommitOutput(2,24,0)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds].
This timeout is controlled by spark.rpc.askTimeout
at
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at
org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
at
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
at
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 13 more
...
16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
...
16/10/25 05:39:24 INFO SparkHadoopMapRedUtil:
attempt_201610250536_0002_m_000024_119: Not committed because the driver did
not authorize commit
...
> AskPermissionToCommitOutput response lost, driver enter into task deadloop
> --------------------------------------------------------------------------
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
> Issue Type: Bug
> Components: Scheduler
> Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
> Reporter: xuqing
> Priority: Minor
>
> Executor sends *AskPermissionToCommitOutput* to driver, driver receives this
> message and handles it. But executor does not get response in
> *spark.rpc.askTimeout*. The TaskAttemptNumber for this partition in
> authorizedCommittersByStage is locked forever. Driver enters into infinite
> loop.
> h4. Driver Log:
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110,
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2,
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing,
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID
> 28732, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID
> 28732, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for
> job: 2, partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing,
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> h4. Executor Log:
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message =
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10
> seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:190)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
> ... 13 more
> ...
> 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
> ...
> 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil:
> attempt_201610250536_0002_m_000024_119: Not committed because the driver did
> not authorize commit
> ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]