Mark Gui created HDDS-5514:
------------------------------

             Summary: Consider to add a flag that looses the condition for 
finalizing for datanode.
                 Key: HDDS-5514
                 URL: https://issues.apache.org/jira/browse/HDDS-5514
             Project: Apache Ozone
          Issue Type: Sub-task
            Reporter: Mark Gui


Here is a log that we got from a non-rolling upgrade:

local/master(0766d2cd23afb29f0eb42cf95b09d3d2984c14fa) -> 
upstream/master(57d42b12d3b6451e2ac8519780e82993ecce3611)
{code:java}
// code placeholder
2021-07-27 20:49:48,491 [Command processor thread] INFO 
org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Finalization 
started.2021-07-27 20:49:48,502 [Command processor thread] WARN 
org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: FinalizeUpgrade : Waiting for 
container to close, current state is: UNHEALTHY2021-07-27 20:49:48,503 [Command 
processor thread] INFO org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Pre 
Finalization checks failed on the DataNode.
2021-07-27 20:49:48,503 [Command processor thread] WARN 
org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor: Upgrade 
Finalization failed with following Exception. 
PREFINALIZE_VALIDATION_FAILED org.apache.hadoop.ozone.upgrade.UpgradeException: 
Pre Finalization checks failed on the DataNode.
        at 
org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:55)
        at 
org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:39)
        at 
org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor.execute(DefaultUpgradeFinalizationExecutor.java:48)
        at 
org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer.finalize(BasicUpgradeFinalizer.java:75)
        at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.finalizeUpgrade(DatanodeStateMachine.java:622)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler.handle(FinalizeNewLayoutVersionCommandHandler.java:78)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
        at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:551)
        at java.lang.Thread.run(Thread.java:748)2021-07-27 20:49:48,503 
[Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
 Processing FinalizeNewLayoutVersionCommandHandler command.
2021-07-27 20:49:48,503 [Command processor thread] INFO 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
 Finalize Upgrade called!
{code}
Finalize on datanode checks whether there are containers at non-closed states:
{code:java}
// DataNodeUpgradeFinalizer.java
private boolean canFinalizeDataNode(DatanodeStateMachine dsm) {
  // Lets be sure that we do not have any open container before we return
  // from here. This function should be called in its own finalizer thread
  // context.
  Iterator<Container<?>> containerIt =
      dsm.getContainer().getController().getContainers();
  while (containerIt.hasNext()) {
    Container ctr = containerIt.next();
    ContainerProtos.ContainerDataProto.State state = ctr.getContainerState();
    switch (state) {
    case OPEN:
    case CLOSING:
    case UNHEALTHY:
      LOG.warn("FinalizeUpgrade : Waiting for container to close, current "
          + "state is: {}", state);
      return false;
    default:
      continue;
    }
  }
  return true;
}
{code}
But actually there may be a good many containers in UNHEALTHY states, at least 
in our deployment with about 400000 containers.

 

Actually not all layout features require all containers to be non-UNHEALTHY 
states, such as SCM_HA and some potential features like Merging Rocksdb 
Instances for datanode, which don't touch container layout at all.

And we may want to do non-rolling upgrade first and fix the UNHEALTHY 
containers later, maybe replication manager will handle them later but takes a 
plenty of time.

 

So I suggest to add a flag to make it possible to turn off the check for 
UNHEALTHY containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to