[
https://issues.apache.org/jira/browse/HDDS-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389203#comment-17389203
]
Ethan Rose commented on HDDS-5514:
----------------------------------
Totally agree. I was actually planning on removing this restriction in
HDDS-5432 (still WIP with corner cases to cover) but a separate Jira is good.
IMO a flag is not necessary. Currently none of the layout features reformat
existing containers. I think the plan in the next gen upgrade framework was to
have finalized datanodes be responsible for handling finalization of volumes or
containers that may have been out during the upgrade. I think a simple PR to
just remove UNHEALTHY from the case statement to unblock current upgrades from
master is good.
> Consider to add a flag that looses the condition for finalizing for datanode.
> -----------------------------------------------------------------------------
>
> Key: HDDS-5514
> URL: https://issues.apache.org/jira/browse/HDDS-5514
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Mark Gui
> Priority: Major
>
> Here is a log that we got from a non-rolling upgrade:
> local/master(0766d2cd23afb29f0eb42cf95b09d3d2984c14fa) ->
> upstream/master(57d42b12d3b6451e2ac8519780e82993ecce3611)
> {code:java}
> // code placeholder
> 2021-07-27 20:49:48,491 [Command processor thread] INFO
> org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Finalization
> started.2021-07-27 20:49:48,502 [Command processor thread] WARN
> org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: FinalizeUpgrade : Waiting
> for container to close, current state is: UNHEALTHY2021-07-27 20:49:48,503
> [Command processor thread] INFO
> org.apache.hadoop.ozone.upgrade.UpgradeFinalizer: Pre Finalization checks
> failed on the DataNode.
> 2021-07-27 20:49:48,503 [Command processor thread] WARN
> org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor: Upgrade
> Finalization failed with following Exception.
> PREFINALIZE_VALIDATION_FAILED
> org.apache.hadoop.ozone.upgrade.UpgradeException: Pre Finalization checks
> failed on the DataNode.
> at
> org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:55)
> at
> org.apache.hadoop.ozone.container.upgrade.DataNodeUpgradeFinalizer.preFinalizeUpgrade(DataNodeUpgradeFinalizer.java:39)
> at
> org.apache.hadoop.ozone.upgrade.DefaultUpgradeFinalizationExecutor.execute(DefaultUpgradeFinalizationExecutor.java:48)
> at
> org.apache.hadoop.ozone.upgrade.BasicUpgradeFinalizer.finalize(BasicUpgradeFinalizer.java:75)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.finalizeUpgrade(DatanodeStateMachine.java:622)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler.handle(FinalizeNewLayoutVersionCommandHandler.java:78)
> at
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:99)
> at
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$2(DatanodeStateMachine.java:551)
> at java.lang.Thread.run(Thread.java:748)2021-07-27 20:49:48,503
> [Command processor thread] INFO
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
> Processing FinalizeNewLayoutVersionCommandHandler command.
> 2021-07-27 20:49:48,503 [Command processor thread] INFO
> org.apache.hadoop.ozone.container.common.statemachine.commandhandler.FinalizeNewLayoutVersionCommandHandler:
> Finalize Upgrade called!
> {code}
> Finalize on datanode checks whether there are containers at non-closed states:
> {code:java}
> // DataNodeUpgradeFinalizer.java
> private boolean canFinalizeDataNode(DatanodeStateMachine dsm) {
> // Lets be sure that we do not have any open container before we return
> // from here. This function should be called in its own finalizer thread
> // context.
> Iterator<Container<?>> containerIt =
> dsm.getContainer().getController().getContainers();
> while (containerIt.hasNext()) {
> Container ctr = containerIt.next();
> ContainerProtos.ContainerDataProto.State state = ctr.getContainerState();
> switch (state) {
> case OPEN:
> case CLOSING:
> case UNHEALTHY:
> LOG.warn("FinalizeUpgrade : Waiting for container to close, current "
> + "state is: {}", state);
> return false;
> default:
> continue;
> }
> }
> return true;
> }
> {code}
> But actually there may be a good many containers in UNHEALTHY states, at
> least in our deployment with about 400000 containers.
>
> Actually not all layout features require all containers to be non-UNHEALTHY
> states, such as SCM_HA and some potential features like Merging Rocksdb
> Instances for datanode, which don't touch container layout at all.
> And we may want to do non-rolling upgrade first and fix the UNHEALTHY
> containers later, maybe replication manager will handle them later but takes
> a plenty of time.
>
> So I suggest to add a flag to make it possible to turn off the check for
> UNHEALTHY containers.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]