[
https://issues.apache.org/jira/browse/HDDS-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088205#comment-18088205
]
Attila Doroszlai commented on HDDS-12669:
-----------------------------------------
This turned out to be a real bug:
bq. Currently in ContainerSet.java, recoveringContainerMap records recovering
containers and identifies them by their timeout values. However, this
introduces a issue: if two or more containers start recovering at the exact
same time, they will have identical timeout values. Because it's a map, the
newer entry overwrites the older one. As a result, the overwritten container is
silently dropped from the tracking map. If the actual recovery action for this
untracked container stucks, the StaleRecoveringContainerScrubbingService will
be unaware of it and cannot trigger the timeout cleanup. Consequently, the
container becomes permanently orphaned and stuck in the 'recovering' state.
> Race condition between entries of ContainerSet#recoveringContainerMap
> ---------------------------------------------------------------------
>
> Key: HDDS-12669
> URL: https://issues.apache.org/jira/browse/HDDS-12669
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Attila Doroszlai
> Assignee: Chung-En Lee
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.3.0
>
>
> {code}
> at
> org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:152)
> at
> org.apache.hadoop.ozone.container.TestECContainerRecovery.testECContainerRecoveryWithTimedOutRecovery(TestECContainerRecovery.java:350)
> {code}
> {code:title=https://github.com/apache/ozone/blob/ebf5cc662ba6055fb8a18ad9036cbeabd3c49c29/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestECContainerRecovery.java#L350-L351}
> GenericTestUtils.waitFor(() -> reconstructedDN.get() != null, 10000,
> 100000);
> {code}
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/01/11/35926/it-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/02/19/36861/it-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/03/17/37636/integration-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> *
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/03/22/37865/integration-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]