[ 
https://issues.apache.org/jira/browse/HDFS-7521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250713#comment-14250713
 ] 

Ming Ma commented on HDFS-7521:
-------------------------------

Folks, thanks for the comments.

[~wheat9], I agree with you that simpler solution is better. This state machine 
lib has been used in YARN and MR and proves to be quite useful for debugging 
and especially when new state needs to be added. When we fixed corner cases in 
DN state  management, we actually wanted to investigate ways to do formal 
checking on NN, but there is no good way to do that without state machine, as 
you mentioned. Definitely want to hear what others might want to say about the 
need of state machine lib.

[~zhz], the main reason to have two states is the reduce the overall possible 
states. For most part, liveness and admin are independent. For the case you 
mentioned, it is specified in the diagram, In_Service can be transitioned to 
either Decommission_In_Progress or Decommissioned state upon receiving 
DECOMISSION_REQUESTED event. Yeah, you can't tell from the diagram how the 
decision is based; only source code has the answer.


> Refactor DN state management
> ----------------------------
>
>                 Key: HDFS-7521
>                 URL: https://issues.apache.org/jira/browse/HDFS-7521
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>         Attachments: DNStateMachines.png, HDFS-7521.patch
>
>
> There are two aspects w.r.t. DN state management in NN.
> * State machine management within active NN
> NN maintains states of each data node regarding whether it is running or 
> being decommissioned. But the state machine isn’t well defined. We have dealt 
> with some corner case bug in this area. It will be useful if we can refactor 
> the code to use clear state machine definition that define events, available 
> states and actions for state transitions. It has these benefits.
> ** Make it easy to define correctness of DN state management. Currently some 
> of the state transitions aren't defined in the code. For example, if admins 
> remove a node from include host file while the node is being decommissioned, 
> it will be transitioned to DEAD and DECOMM_IN_PROGRESS. That might not be the 
> intention. If we have state machine definition, we can identify this case.
> ** Make it easy to add new state for DN later. For example, people discussed 
> about new “maintenance” state for DN to support the scenario where admins 
> need to take the machine/rack down for 30 minutes for repair.
> We can refactor DN with clear state machine definition based on YARN state 
> related components.
> * State machine consistency between active and standby NN
> Another dimension of state machine management is consistency across NN pairs. 
> We have dealt with bugs due to different live nodes between active NN and 
> standby NN. Current design is to have each NN manage its own state based on 
> the events it receives. For example, DNs will send heartbeat to both NNs; 
> admins will issue decommission commands to both NNs. Alternative design 
> approach could be to have ZK manage the state.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to