GitHub user lianhuiwang opened a pull request:
https://github.com/apache/spark/pull/4367
[SPARK-5529][Core]Replace blockManager's timeoutChecking with executor's
timeoutChecking
the phenomenon is:
blockManagerSlave is timeout and BlockManagerMasterActor will remove this
blockManager, but executor on this blockManager is not timeout because akka's
heartbeat is normal.
Because blockManager is in executor, if blockManager is removed, executor
on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen
onBlockManagerRemoved and remove this executor. but actually in
CoarseGrainedSchedulerBackend it is still in executorDataMap.
so i think that we can remove timeoutChecking of BlockManagerMasterActor.
and add executor's timeoutChecking of HeartbeatReceiver.
if executor is timeout in HeartbeatReceiver,
Firstly,we tell TaskSchedulerImpl to executorLost and TaskSchedulerImpl
will tell dagScheduler executorLost, then dagScheduler will tell
blockManagerMaster to remove BlockManager of this executor.
Next, we tell CoarseGrainedSchedulerBackend to kill executor that is
timeout by SparkContext.killExecutor api.
In the future, if we remove akka and implement ourself RPC, we just need to
replace akka. and timeoutChecking to HeartbeatReceiver can be reserved for
other RPC.
Maybe we should change "spark.storage.blockManagerSlaveTimeoutMs" to
"spark.executor.slaveTimeoutMs", "spark.storage.blockManagerTimeoutIntervalMs"
to "spark.executor.timeoutIntervalMs"?
@rxin @tdas @sryza @andrewor14
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/lianhuiwang/spark SPARK-5529
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4367.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4367
----
commit aeb74b02a5521185c2cb571388b577a7af4e8da9
Author: lianhuiwang <[email protected]>
Date: 2015-02-04T12:27:33Z
Replace blockManager's timeoutChecking of BlockManagerMasterActor with
executor's timeoutChecking of HeartbeatReceiver
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]