Mark Gui created HDDS-5514:
------------------------------
Summary: Consider to add a flag that looses the condition for
finalizing for datanode.
Key: HDDS-5514
URL: https://issues.apache.org/jira/browse/HDDS-5514
Project: Apache Ozone
Issue Type: Sub-task
Reporter: Mark Gui
Here is a log that we got from a non-rolling upgrade:
local/master(0766d2cd23afb29f0eb42cf95b09d3d2984c14fa) ->
upstream/master(57d42b12d3b6451e2ac8519780e82993ecce3611)
{code:java}
// code placeholder
2021-07-27 20:49:48,491 [Command processor thread] INFO
org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Finalization
started.2021-07-27 20:49:48,502 [Command processor thread] WARN
org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: FinalizeUpgrade : Waiting for
container to close, current state is: UNHEALTHY2021-07-27 20:49:48,503 [Command
processor thread] INFO org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Pre
Finalization checks failed on the DataNode.
2021-07-27 20:49:48,503 [Command processor thread] WARN
org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor: Upgrade
Finalization failed with following Exception.
PREFINALIZE_VALIDATION_FAILED org.apache.hadoop.ozone.upgrade.UpgradeException:
Pre Finalization checks failed on the DataNode.
at
org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:55)
at
org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:39)
at
org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor.execute(DefaultUpgradeFinalizationExecutor.java:48)
at
org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer.finalize(BasicUpgradeFinalizer.java:75)
at
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.finalizeUpgrade(DatanodeStateMachine.java:622)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler.handle(FinalizeNewLayoutVersionCommandHandler.java:78)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
at
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:551)
at java.lang.Thread.run(Thread.java:748)2021-07-27 20:49:48,503
[Command processor thread] INFO
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
Processing FinalizeNewLayoutVersionCommandHandler command.
2021-07-27 20:49:48,503 [Command processor thread] INFO
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
Finalize Upgrade called!
{code}
Finalize on datanode checks whether there are containers at non-closed states:
{code:java}
// DataNodeUpgradeFinalizer.java
private boolean canFinalizeDataNode(DatanodeStateMachine dsm) {
// Lets be sure that we do not have any open container before we return
// from here. This function should be called in its own finalizer thread
// context.
Iterator<Container<?>> containerIt =
dsm.getContainer().getController().getContainers();
while (containerIt.hasNext()) {
Container ctr = containerIt.next();
ContainerProtos.ContainerDataProto.State state = ctr.getContainerState();
switch (state) {
case OPEN:
case CLOSING:
case UNHEALTHY:
LOG.warn("FinalizeUpgrade : Waiting for container to close, current "
+ "state is: {}", state);
return false;
default:
continue;
}
}
return true;
}
{code}
But actually there may be a good many containers in UNHEALTHY states, at least
in our deployment with about 400000 containers.
Actually not all layout features require all containers to be non-UNHEALTHY
states, such as SCM_HA and some potential features like Merging Rocksdb
Instances for datanode, which don't touch container layout at all.
And we may want to do non-rolling upgrade first and fix the UNHEALTHY
containers later, maybe replication manager will handle them later but takes a
plenty of time.
So I suggest to add a flag to make it possible to turn off the check for
UNHEALTHY containers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]