[ 
https://issues.apache.org/jira/browse/HDDS-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088205#comment-18088205
 ] 

Attila Doroszlai commented on HDDS-12669:
-----------------------------------------

This turned out to be a real bug:

bq. Currently in ContainerSet.java, recoveringContainerMap records recovering 
containers and identifies them by their timeout values. However, this 
introduces a issue: if two or more containers start recovering at the exact 
same time, they will have identical timeout values. Because it's a map, the 
newer entry overwrites the older one. As a result, the overwritten container is 
silently dropped from the tracking map. If the actual recovery action for this 
untracked container stucks, the StaleRecoveringContainerScrubbingService will 
be unaware of it and cannot trigger the timeout cleanup. Consequently, the 
container becomes permanently orphaned and stuck in the 'recovering' state.

> Race condition between entries of ContainerSet#recoveringContainerMap
> ---------------------------------------------------------------------
>
>                 Key: HDDS-12669
>                 URL: https://issues.apache.org/jira/browse/HDDS-12669
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Attila Doroszlai
>            Assignee: Chung-En Lee
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.3.0
>
>
> {code}
>       at 
> org.apache.ozone.test.GenericTestUtils.waitFor(GenericTestUtils.java:152)
>       at 
> org.apache.hadoop.ozone.container.TestECContainerRecovery.testECContainerRecoveryWithTimedOutRecovery(TestECContainerRecovery.java:350)
> {code}
> {code:title=https://github.com/apache/ozone/blob/ebf5cc662ba6055fb8a18ad9036cbeabd3c49c29/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/container/TestECContainerRecovery.java#L350-L351}
>     GenericTestUtils.waitFor(() -> reconstructedDN.get() != null, 10000,
>             100000);
> {code}
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/01/11/35926/it-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/02/19/36861/it-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/03/17/37636/integration-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt
> * 
> https://github.com/adoroszlai/ozone-build-results/blob/master/2025/03/22/37865/integration-container/hadoop-ozone/integration-test/org.apache.hadoop.ozone.container.TestECContainerRecovery.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to