KaiXu created SPARK-19528:
-----------------------------
Summary: external shuffle service would close while still have
request from executor when dynamic allocation is enabled
Key: SPARK-19528
URL: https://issues.apache.org/jira/browse/SPARK-19528
Project: Spark
Issue Type: Bug
Components: Block Manager, Shuffle, Spark Core
Affects Versions: 1.6.2
Environment: Hadoop2.7.1
spark1.6.2
hive2.2
Reporter: KaiXu
when dynamic allocation is enabled, the external shuffle service is used for
maintain the unfinished status between executors. So the external shuffle
service should not close before the executor while still have request from
executor.
container's log:
17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to
driver: spark://[email protected]:41867
17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully
registered with driver
17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host
hsx-node8
17/02/09 08:30:46 INFO util.Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374.
17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 40374
17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port =
7337
17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register
BlockManager
17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local
external shuffle service.
17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 requests
outstanding when connection from hsx-node8/192.168.1.8:7337 is closed
17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external
shuffle server, will retry 2 more times after waiting 5 seconds...
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout
waiting for task.
at
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
at
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
at
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
at
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
at
org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
... 14 more
17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external
shuffle server, will retry 1 more times after waiting 5 seconds...
nodemanager's log:
2017-02-09 08:30:48,836 INFO
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed
completed containers from NM context: [container_1486564603520_0097_01_000005]
2017-02-09 08:31:12,122 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code
from container container_1486564603520_0096_01_000071 is : 1
2017-02-09 08:31:12,122 WARN
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception
from container-launch with container ID: container_1486564603520_0096_01_000071
and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from
container-launch.
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id:
container_1486564603520_0096_01_000071
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace:
ExitCodeException exitCode=1:
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.util.Shell.run(Shell.java:456)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at
java.lang.Thread.run(Thread.java:745)
2017-02-09 08:31:12,122 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container exited with a non-zero exit code 1
2017-02-09 08:31:12,122 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_1486564603520_0096_01_000071 transitioned from RUNNING to
EXITED_WITH_FAILURE
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]