[
https://issues.apache.org/jira/browse/SPARK-17501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484185#comment-15484185
]
cen yuhai edited comment on SPARK-17501 at 9/15/16 9:19 AM:
------------------------------------------------------------
I can't hardly reproduce this error. But maybe I found the root cause. In
HeatbeatReceiver, executor is recorded by executorLastSeen. But Blockmanager is
recorded by blockManagerInfo in BlockManagerMasterEndpoint.It should not
register BlockManager,I think just put it into executorLastSeen which will
resolve this problem.
was (Author: cenyuhai):
I can't hardly reproduce this error. But maybe I found the root cause. In
HeatbeatReceiver, executor is record by executorLastSeen. But Blockmanager is
record by blockManagerInfo in BlockManagerMasterEndpoint.It should not register
BlockManager,Executor need to send RegisterExecutor.
> Re-register BlockManager again and again
> ----------------------------------------
>
> Key: SPARK-17501
> URL: https://issues.apache.org/jira/browse/SPARK-17501
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.2
> Reporter: cen yuhai
> Priority: Minor
>
> After many times re-register, executor will exit because of timeout
> exception....
> {code}
> 16/09/11 04:02:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:42 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:02:52 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:02:52 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:02:52 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:02 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:02 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:02 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:12 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:12 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:12 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:22 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:22 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:22 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:32 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:32 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:32 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:42 INFO executor.Executor: Told to re-register on heartbeat
> 16/09/11 04:03:42 INFO storage.BlockManager: BlockManager re-registering with
> master
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManagerMaster: Registered BlockManager
> 16/09/11 04:03:42 INFO storage.BlockManager: Reporting 0 blocks to the master.
> 16/09/11 04:03:45 ERROR executor.CoarseGrainedExecutorBackend: Cannot
> register with driver:
> spark://[email protected]:22168
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120
> seconds. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
> at scala.util.Try$.apply(Try.scala:161)
> at scala.util.Failure.recover(Try.scala:185)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> at
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at scala.concurrent.Promise$class.complete(Promise.scala:55)
> at
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
> at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]