Github user juanrh commented on the issue:
https://github.com/apache/spark/pull/19267
Hi @vanzin, thanks for taking a look.
This was part of a discussion with @holdenk about SPARK-20628. I have
attached the document
[Spark_Blacklisting_on_decommissioning-Scope.pdf](https://github.com/apache/spark/files/1349653/Spark_Blacklisting_on_decommissioning-Scope.pdf)
with our approach. The basic idea is implementing a mechanism similar to
YARN's graceful decommission, but for Spark. In the PR we define a `HostState`
type to represent the state of the cluster nodes, and take actions in
`CoarseGrainedSchedulerBackend.handleUpdatedHostState` when a node transitions
into a state where the node becomes partially or totally unavailable. Just like
in YARN or Mesos, we propose a decommission mechanism with 2 phases, first a
drain phase where the node is still running but not accepting further work
(DECOMMISSIONING in YARN, and DRAIN in Mesos), followed by a second phase where
executors in the node are forcibly shut down (DECOMMISIONED in YARN, and DOWN
in Mesos). In this PR we focus only in YARN, and in the actions wh
en the node transitions into DECOMMISSIONING state: blacklisting the node when
it transitions to DECOMMISSIONING, and un-blacklist the node when it gets back
to the normal healthy RUNNING state.
The decommissioning process would not be initiated by Spark, but by an
operator or an automated system (e.g. the cloud environment where YARN is
running), on response to some relevant event (e.g. a cluster resize event), and
it would consist on calling the YARN administrative command `yarn rmadmin
-refreshNodes -g` for the affected node. Spark would just react to the node
state transition events it receives from the cluster manager.
To make this extensible to other cluster managers besides YARN, we define
the `HostState` type in Spark, and keep the interaction with the specifics of
each cluster manager into the corresponding packages. For example for YARN,
when `YarnAllocator` gets a node state transition event, it converts the node
event from the YARN specific
[NodeState](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/NodeState.html)
into `HostState`, wraps it into a `HostStatusUpdate` message, and sends it to
the `CoarseGrainedSchedulerBackend`, that then performs the required actions
for that node.
This code works on a modified version of Hadoop 2.7.3 with patches to
support [YARN-4676](https://issues.apache.org/jira/browse/YARN-4676) (basic
graceful decommission), and an approximation to
[YARN-3224](https://issues.apache.org/jira/browse/YARN-3224) (when a node
transitions into DECOMMISSIONING state the resource manager notifies that to
each relevant application master by adding it to the list of updated nodes
available in the
[AllocateResponse](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.html#getUpdatedNodes())
returned by the RM as a response to the AM heartbeat). For these reasons, this
code won't work as-is on vanilla Hadoop. The main problem is that the
decommissioning mechanism for YARN is not completely implemented (see
[YARN-914](https://issues.apache.org/jira/browse/YARN-914)), and some of the
parts that are implemented are only available for YARN 2.9.0 (see
[YARN-4676](https://issues.apache.org/jira/browse/
YARN-4676)). To cope with this, we propose implementing an administrative
command to send node transitions directly to the Spark driver, as
`HostStatusUpdate` messages addressed to the `CoarseGrainedSchedulerBackend`.
This command would be similar to the `yarn rmadmin -refreshNodes -g`, which is
currently used for decommissioning nodes in YARN. When YARN-914 is complete,
this could still be used as a secondary interface for decommissioning nodes, so
nodes transitions could be signaled either by the cluster manager, or using the
administrative command (either manually or through some automation implemented
by the cloud environment).
We would like to get some feedback on this approach in general, and in the
administrative command solution in particular. If that sounds good, then we
will work on modifying this PR so it works on vanilla Hadoop 2.7, and to
implement the administrative command.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]