Zhang, Liye created SPARK-4991:
----------------------------------
Summary: Worker should reconnect to Master when Master actor
restart
Key: SPARK-4991
URL: https://issues.apache.org/jira/browse/SPARK-4991
Project: Spark
Issue Type: Improvement
Components: Deploy, Spark Core
Affects Versions: 1.2.0, 1.1.0, 1.0.0
Reporter: Zhang, Liye
This is a following JIRA of
[SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master akka
actor encounter an exception, the Master will restart (akka actor restart not
JVM restart). And all old information are cleared on Master (including workers,
applications, etc). However, the workers are not aware of this at all. The
state of the cluster is that: the master is on, and all workers are also on,
but master is not aware of the exists of workers, and will ignore all worker's
heartbeat because all workers are not registered. So that the whole cluster is
not available.
For some other information about this part, please refer to
[SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and
[SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]