[ 
https://issues.apache.org/jira/browse/HDDS-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDDS-2592:
------------------------------------
    Description: 
When the operational state of a datanode changes, an async command should be 
triggered to persist the new state on the datanodes. For maintenance mode, the 
datanode should also store the maintenance end time. The datanode will then 
report the new state (and optional maintenance end time) back via its heartbeat.

The purpose of the DN persisting this information and heartbeating it back to 
SCM is to allow the operation state to be recovered after a SCM reboot, as SCM 
does not persist any of this information. It also allows "Recon" to learn the 
datanode states.

If SCM is restarted, then it will forget all knowledge of the datanodes. When 
they register, their operational state will be reported and SCM can set it 
correctly.

Outside of registration (ie during normal heartbeats), the SCM state is the 
source of truth for the operational state and if the DN heartbeat reports a 
state that is not the same as SCM, SCM should issue another command to the 
datanode to set its state to the SCM value. There is a chance the state miss 
match is due to an unprocessed command triggered by the SCM state change, but 
the worst case is an extra command sent to the datanode. This is a very 
lightweight command, so that is not an issue.

One open question is whether to persist intermediate states on the DN. Ie for 
decommissioning, the DN will first persist "Decommissioning" and then 
transition to "Decommissioned" when SCM is satisfied all containers are 
replicated. It would be possible to persist both these states in turn on the 
datanode quite easily in turn. Or, we set the end state (Decommissioned) on the 
datanode and allow SCM to get the node to that state. For the latter, if SCM is 
restarted, then the DN will report "Decommissioned" on registration, but SCM 
will set its internal state to Decommissioning and then ensure all containers 
are replicated before transitioning the node to Decommissioned. This seems like 
a safer approach, but there are advantages of tracking the intermediate states 
on the DNs too.

  was:
When a node is decommissioned or put into maintenance, SCM will receive the 
command to kick off the workflow. As part of that workflow, it should issue a 
further command to the datanode to set the datanode as either:

maintenance
decommissioned
in_service (this is the default state)

This state should be persisted in the datanode yaml file so it survives reboots.

Upon receiving this command, the datanode will return a new state for all its 
containers in the next container report.

For all closed containers it should return a state of DECOMMISSIONED or 
MAINTENANCE accordingly, while non-closed container should return their 
original value until they are closed. That way SCM can monitor for unclosed 
containers as part of the decommission flow.

I don't believe there is any need for the datanode to have multiple states for 
each admin state (eg decommissioning + decommissioned / entering_maintenance + 
in_maintenance) as those are only really relevant to SCM. Instead it should be 
enough to set the datanode state once and assume SCM will cause it to 
eventually reach that state. 

These states will be added via HDDS-2459 to progress the changes in the 
Replication Manager on the SCM side:

{code}
ContainerReplicaProto.State.DECOMMISSIONED
ContainerReplicaProto.State.MAINTENANCE
{code}


> Add Datanode command to allow the datanode to persist its admin state 
> ----------------------------------------------------------------------
>
>                 Key: HDDS-2592
>                 URL: https://issues.apache.org/jira/browse/HDDS-2592
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: Ozone Datanode, SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> When the operational state of a datanode changes, an async command should be 
> triggered to persist the new state on the datanodes. For maintenance mode, 
> the datanode should also store the maintenance end time. The datanode will 
> then report the new state (and optional maintenance end time) back via its 
> heartbeat.
> The purpose of the DN persisting this information and heartbeating it back to 
> SCM is to allow the operation state to be recovered after a SCM reboot, as 
> SCM does not persist any of this information. It also allows "Recon" to learn 
> the datanode states.
> If SCM is restarted, then it will forget all knowledge of the datanodes. When 
> they register, their operational state will be reported and SCM can set it 
> correctly.
> Outside of registration (ie during normal heartbeats), the SCM state is the 
> source of truth for the operational state and if the DN heartbeat reports a 
> state that is not the same as SCM, SCM should issue another command to the 
> datanode to set its state to the SCM value. There is a chance the state miss 
> match is due to an unprocessed command triggered by the SCM state change, but 
> the worst case is an extra command sent to the datanode. This is a very 
> lightweight command, so that is not an issue.
> One open question is whether to persist intermediate states on the DN. Ie for 
> decommissioning, the DN will first persist "Decommissioning" and then 
> transition to "Decommissioned" when SCM is satisfied all containers are 
> replicated. It would be possible to persist both these states in turn on the 
> datanode quite easily in turn. Or, we set the end state (Decommissioned) on 
> the datanode and allow SCM to get the node to that state. For the latter, if 
> SCM is restarted, then the DN will report "Decommissioned" on registration, 
> but SCM will set its internal state to Decommissioning and then ensure all 
> containers are replicated before transitioning the node to Decommissioned. 
> This seems like a safer approach, but there are advantages of tracking the 
> intermediate states on the DNs too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to