avijayanhwx commented on pull request #2286:
URL: https://github.com/apache/ozone/pull/2286#issuecomment-852434597
> I think there is another scenario this change won't catch.
>
> 1. Node is put to maintenance with a timeout value.
> 2. Node goes to IN_MAINTENANCE and is then stopped and goes dead.
> 3. Node does not come back in time, and SCM automatically expires the
maintenance, moving the host to IN_SERVICE + DEAD.
>
> In this case the dead node handler would fire once on Recon, and populate
the dead nodes array, then the state would be checked and the array cleared.
The later transition to IN_SERVICE + DEAD would never get seen by Recon.
>
> I wonder if it would be simpler to forget about the dead node handler on
Recon, and just sync up all nodes on SCM with the nodes on Recon on each run
(one every 5 minutes I think). Even with a few 1000 nodes, this should not have
a big overhead.
>
> In saying that, using the dead node handler is a nice optimisation.
>
> We could ask SCM for a list of nodes which are only in the DEAD state and
sync up that group. There is an existing API for that in SCMNodeManager:
>
> ```
> /**
> * Returns all datanode that are in the given states. Passing null for
one of
> * of the states acts like a wildcard for that state. This function
works by
> * taking a snapshot of the current collection and then returning the
list
> * from that collection. This means that real map might have changed by
the
> * time we return this list.
> *
> * @param opState The operational state of the node
> * @param health The health of the node
> * @return List of Datanodes that are known to SCM in the requested
states.
> */
> @Override
> public List<DatanodeDetails> getNodes(
> NodeOperationalState opState, NodeState health) {
> return nodeStateManager.getNodes(opState, health)
> .stream()
> .map(node -> (DatanodeDetails)node).collect(Collectors.toList());
> }
> ```
>
> I think that would cover both scenarios.
Thanks for the review @sodonnel. The current patch does a general sync
through the
[PipelineSyncTask](https://github.com/apache/ozone/pull/2286/files#diff-b4ccedc2f618c5a04940774373e8cdb6f04742c43deaec941d5cd0b0f4589ad4R90)
as well. Doesn't that handle that case? To be fair, the DeadNodeHandler logic
that I have added is redundant here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]