[
https://issues.apache.org/jira/browse/HDDS-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17927333#comment-17927333
]
Ivan Andika commented on HDDS-12150:
------------------------------------
[~smeng] Thank you for filing this.
I think there are some other places where RuntimeException are not handled
properly and might crash the some crucial threads silently. For example,
Preconditions orĀ gRPC runtime exceptions might not be caught. We can look out
for possible similar issues.
> Abnormal container states should not crash the SCM ContainerReportHandler
> thread
> --------------------------------------------------------------------------------
>
> Key: HDDS-12150
> URL: https://issues.apache.org/jira/browse/HDDS-12150
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Affects Versions: 1.4.1
> Reporter: Siyao Meng
> Assignee: Siyao Meng
> Priority: Critical
>
> We observed a case where a full container report with one abnormal container
> state can crash SCM leader's ContainerReportHandler thread.
> The reason is that the Precondition check throws a RuntimeException
> (IllegalArgumentException) that isn't caught and handled properly:
> {code:java|title=https://github.com/apache/ozone/blob/69ba680c515a519a2e2fef611efe151aa033d7cd/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/AbstractContainerReportHandler.java#L339-L340}
> case QUASI_CLOSED:
> /*
> * The container is in QUASI_CLOSED state, this means that at least
> * one of the replica was QUASI_CLOSED.
> *
> * Now replicas can be in any of the following state.
> *
> * 1. OPEN
> * 2. CLOSING
> * 3. QUASI_CLOSED
> * 4. CLOSED
> *
> * If at least one of the replica is in CLOSED state, mark the
> * container as CLOSED.
> *
> */
> if (replica.getState() == State.CLOSED) {
> logger.info("Moving container {} to CLOSED state, datanode {} " +
> "reported CLOSED replica.", containerId, datanode);
> Preconditions.checkArgument(replica.getBlockCommitSequenceId()
> == container.getSequenceId());
> containerManager.updateContainerState(containerId,
> LifeCycleEvent.FORCE_CLOSE);
> }
> break;
> {code}
> It causes the rest of the container report to be left unprocessed. That leads
> to a huge number of MISSING containers seen in {{ozone admin container
> report}} .
> But those containers are not actually missing. The container DB and blocks
> are still on the datanode volumes/disks. It's just that those container
> reports are not being processed, leading SCM to think they are missing.
> Repro (to be added as a test case):
> 1. SCM has container id 4071867 in QUASI_CLOSED state, bcsId = 208
> 2. A full container report from datanode 1 has a replica of container 4071867
> in {color:red}CLOSED state, bcsId = 0{color}
> 3. Without the patch, other container reports after the above would NOT be
> processed because of the ContainerReportHandler thread crashed due to
> unhandled exception
> 4. With the patch, a warning would be logged for the abnormal container
> replica, but the other container reports would still be processed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]