Ming Ma created HDFS-7521:
-----------------------------
Summary: Refactor DN state management
Key: HDFS-7521
URL: https://issues.apache.org/jira/browse/HDFS-7521
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: Ming Ma
There are two aspects w.r.t. DN state management in NN.
* State machine management within active NN
NN maintains states of each data node regarding whether it is running or being
decommissioned. But the state machine isn’t well defined. We have dealt with
some corner case bug in this area. It will be useful if we can refactor the
code to use clear state machine definition that define events, available states
and actions for state transitions. It has these benefits.
** Make it easy to define correctness of DN state management. Currently some of
the state transitions aren't defined in the code. For example, if admins remove
a node from include host file while the node is being decommissioned, it will
be transitioned to DEAD and DECOMM_IN_PROGRESS. That might not be the
intention. If we have state machine definition, we can identify this case.
** Make it easy to add new state for DN later. For example, people discussed
about new “maintenance” state for DN to support the scenario where admins need
to take the machine/rack down for 30 minutes for repair.
We can refactor DN with clear state machine definition based on YARN state
related components.
* State machine consistency between active and standby NN
Another dimension of state machine management is consistency across NN pairs.
We have dealt with bugs due to different live nodes between active NN and
standby NN. Current design is to have each NN manage its own state based on the
events it receives. For example, DNs will send heartbeat to both NNs; admins
will issue decommission commands to both NNs. Alternative design approach we
discuss is to have ZK manage the state.
Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)