[
https://issues.apache.org/jira/browse/HDDS-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Krishna Kumar Asawa reassigned HDDS-6634:
-----------------------------------------
Assignee: Sadanand Shenoy
> Incorrect datanode state drop after lose connection to SCM server
> ------------------------------------------------------------------
>
> Key: HDDS-6634
> URL: https://issues.apache.org/jira/browse/HDDS-6634
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Mikhail Pochatkin
> Assignee: Sadanand Shenoy
> Priority: Major
>
> At the moment, when receiving a response from SCM server, datanode has
> special part in
> *org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask#processResponse*
> which has a re-registration part in case when SCM didn't recognize datanode.
> {code:java}
> private void processResponse(SCMHeartbeatResponseProto response,
> final DatanodeDetailsProto datanodeDetails) {
> ...
> case reregisterCommand:
> if (rpcEndpoint.getState() == EndPointStates.HEARTBEAT) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Received SCM notification to register."
> + " Interrupt HEARTBEAT and transit to REGISTER state.");
> }
> rpcEndpoint.setState(EndPointStates.REGISTER);
> }
> ...{code}
> Assuming that there is a possibility that a forced rebootstrap cluster has
> occurred on the existing SCM server (BUT without rebootsrap one datanode in
> this cluster) host then new SCM server has different Cluster ID version. In
> this case our datanode will register again in SCM server and continue to push
> new reports by
> *org.apache.hadoop.ozone.container.common.report.ReportManager* with
> containers information. Yes, before processing datanode reports, the SCM
> server checks for the existence of containers with the id from the report,
> but raised two questions here
> # Can't there be a case when container ids start to intersect between the
> new cluster and the old one?
> # Can there be data corruption in this case?
> My suggestion here to change default state in reregister case to
> EndPointStates.GETVERSION, it force datanode to verify that cluster ID is
> equals which SCM has before start registration.
> {code:java}
> private void processResponse(SCMHeartbeatResponseProto response,
> final DatanodeDetailsProto datanodeDetails) {
> ...
> case reregisterCommand:
> if (rpcEndpoint.getState() == EndPointStates.HEARTBEAT) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Received SCM notification to register."
> + " Interrupt HEARTBEAT and transit to GETVERSION state.");
> }
> rpcEndpoint.setState(EndPointStates.GETVERSION); {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]