Github user juanrh commented on the issue:

    https://github.com/apache/spark/pull/19267
  
    Hi @vanzin, thanks for taking a look.
    
    This was part of a discussion with @holdenk about SPARK-20628. I have 
attached the document 
[Spark_Blacklisting_on_decommissioning-Scope.pdf](https://github.com/apache/spark/files/1349653/Spark_Blacklisting_on_decommissioning-Scope.pdf)
 with our approach. The basic idea is implementing a mechanism similar to 
YARN's graceful decommission, but for Spark. In the PR we define a `HostState` 
type to represent the state of the cluster nodes, and take actions in 
`CoarseGrainedSchedulerBackend.handleUpdatedHostState` when a node transitions 
into a state where the node becomes partially or totally unavailable. Just like 
in YARN or Mesos, we propose a decommission mechanism with 2 phases, first a 
drain phase where the node is still running but not accepting further work 
(DECOMMISSIONING in YARN, and DRAIN in Mesos), followed by a second phase where 
executors in the node are forcibly shut down (DECOMMISIONED in YARN, and DOWN 
in Mesos). In this PR we focus only in YARN, and in the actions wh
 en the node transitions into DECOMMISSIONING state: blacklisting the node when 
it transitions to DECOMMISSIONING, and un-blacklist the node when it gets back 
to the normal healthy RUNNING state.
    The decommissioning process would not be initiated by Spark, but by an 
operator or an automated system (e.g. the cloud environment where YARN is 
running), on response to some relevant event (e.g. a cluster resize event), and 
it would consist on calling the YARN administrative command `yarn rmadmin 
-refreshNodes -g` for the affected node. Spark would just react to the node 
state transition events it receives from the cluster manager.
    To make this extensible to other cluster managers besides YARN, we define 
the `HostState` type in Spark, and keep the interaction with the specifics of 
each cluster manager into the corresponding packages. For example for YARN, 
when `YarnAllocator` gets a node state transition event, it converts the node 
event from the YARN specific 
[NodeState](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/records/NodeState.html)
 into `HostState`, wraps it into a `HostStatusUpdate` message, and sends it to 
the `CoarseGrainedSchedulerBackend`, that then performs the required actions 
for that node.
    
    This code works on a modified version of Hadoop 2.7.3 with patches to 
support [YARN-4676](https://issues.apache.org/jira/browse/YARN-4676) (basic 
graceful decommission), and an approximation to 
[YARN-3224](https://issues.apache.org/jira/browse/YARN-3224) (when a node 
transitions into DECOMMISSIONING state the resource manager notifies that to 
each relevant application master by adding it to the list of updated nodes 
available in the 
[AllocateResponse](https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/yarn/api/protocolrecords/AllocateResponse.html#getUpdatedNodes())
 returned by the RM as a response to the AM heartbeat). For these reasons, this 
code won't work as-is on vanilla Hadoop. The main problem is that the 
decommissioning mechanism for YARN is not completely implemented (see 
[YARN-914](https://issues.apache.org/jira/browse/YARN-914)), and some of the 
parts that are implemented are only available for YARN 2.9.0 (see 
[YARN-4676](https://issues.apache.org/jira/browse/
 YARN-4676)). To cope with this, we propose implementing an administrative 
command to send node transitions directly to the Spark driver, as 
`HostStatusUpdate` messages addressed to the `CoarseGrainedSchedulerBackend`. 
This command would be similar to the `yarn rmadmin -refreshNodes -g`, which is 
currently used for decommissioning nodes in YARN. When YARN-914 is complete, 
this could still be used as a secondary interface for decommissioning nodes, so 
nodes transitions could be signaled either by the cluster manager, or using the 
administrative command (either manually or through some automation implemented 
by the cloud environment).
    
    We would like to get some feedback on this approach in general, and in the 
administrative command solution in particular. If that sounds good, then we 
will work on modifying this PR so it works on vanilla Hadoop 2.7, and to 
implement the administrative command.
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to