[jira] [Assigned] (HDDS-6634) Incorrect datanode state drop after lose connection to SCM server

Krishna Kumar Asawa (Jira) Sun, 20 Aug 2023 22:27:47 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Krishna Kumar Asawa reassigned HDDS-6634:
-----------------------------------------

    Assignee: Sadanand Shenoy

> Incorrect datanode state drop after lose connection to SCM server 
> ------------------------------------------------------------------
>
>                 Key: HDDS-6634
>                 URL: https://issues.apache.org/jira/browse/HDDS-6634
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Mikhail Pochatkin
>            Assignee: Sadanand Shenoy
>            Priority: Major
>
> At the moment, when receiving a response from SCM server, datanode has 
> special part in 
> *org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask#processResponse*
>  which has a re-registration part in case when SCM didn't recognize datanode.
> {code:java}
> private void processResponse(SCMHeartbeatResponseProto response,
>     final DatanodeDetailsProto datanodeDetails) {
>   ...
>     case reregisterCommand:
>       if (rpcEndpoint.getState() == EndPointStates.HEARTBEAT) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Received SCM notification to register."
>               + " Interrupt HEARTBEAT and transit to REGISTER state.");
>         }
>         rpcEndpoint.setState(EndPointStates.REGISTER);
>       } 
> ...{code}
> Assuming that there is a possibility that a forced rebootstrap cluster has 
> occurred on the existing SCM server (BUT without rebootsrap one datanode in 
> this cluster) host then new SCM server has different Cluster ID version. In 
> this case our datanode will register again in SCM server and continue to push 
> new reports by 
> *org.apache.hadoop.ozone.container.common.report.ReportManager* with 
> containers information. Yes, before processing datanode reports, the SCM 
> server checks for the existence of containers with the id from the report, 
> but raised two questions here 
>  # Can't there be a case when container ids start to intersect between the 
> new cluster and the old one?
>  # Can there be data corruption in this case?
> My suggestion here to change default state in reregister case to 
> EndPointStates.GETVERSION, it force datanode to verify that cluster ID is 
> equals which SCM has before start registration.
> {code:java}
> private void processResponse(SCMHeartbeatResponseProto response,
>     final DatanodeDetailsProto datanodeDetails) {
>   ...
>     case reregisterCommand:
>       if (rpcEndpoint.getState() == EndPointStates.HEARTBEAT) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Received SCM notification to register."
>               + " Interrupt HEARTBEAT and transit to GETVERSION state.");
>         }
>         rpcEndpoint.setState(EndPointStates.GETVERSION); {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (HDDS-6634) Incorrect datanode state drop after lose connection to SCM server

Reply via email to