[jira] [Assigned] (HDDS-15651) [DiskBalancer] markContainerForDelete failure is treated as a successful move

Gargi Jaiswal (Jira) Wed, 24 Jun 2026 03:16:18 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gargi Jaiswal reassigned HDDS-15651:
------------------------------------

    Assignee: Gargi Jaiswal

> [DiskBalancer] markContainerForDelete failure is treated as a successful move
> -----------------------------------------------------------------------------
>
>                 Key: HDDS-15651
>                 URL: https://issues.apache.org/jira/browse/HDDS-15651
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Arun Sarin
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: repro_HDDS_markContainerForDelete_BEFORE_fix.log
>
>
> During a DiskBalancer container move, the datanode copies the container to the
> destination volume, updates ContainerSet to point to the new replica, and then
> calls markContainerForDelete() on the old source replica.
>  
> If markContainerForDelete() fails, DiskBalancer still treats the move as
> successful. Success metrics are updated and the old replica may be queued for
> delayed deletion, even though the source replica was never properly marked
> DELETED. This can leave duplicate replicas on disk and make disk usage and
> balancer status misleading.
>  
> While reviewing DiskBalancerService.DiskBalancerTask.call(), I noticed that
> moveSucceeded is set to true before markContainerForDelete() is called. If 
> mark
> fails, the error is only logged and the move is still counted as success.
>  
> I added a unit test to reproduce this:
> TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure
>  
> The test simulates a markContainerForDelete() failure and checks that the move
> should be reported as failed, with no duplicate replica left active on the
> destination.
>  
> *Steps to reproduce*
> 1. Run the unit test:
>  
> mvn test -pl hadoop-hdds/container-service -am \
> -Dtest=TestDiskBalancerTask#moveSucceedsDespiteMarkContainerForDeleteFailure \
> -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
>  
> 2. On current master (before fix), the test fails with 8 failures (one per
> container schema variant). Example:
>  
> successCount should be 0 when markContainerForDelete fails
> expected: 0 but was: 1
>  
> *Expected behavior*
> - The move should be counted as a failure (failureCount increases, 
> successCount
> stays 0).
> - ContainerSet should keep the source replica as the active one.
> - Source and destination volume used space should not reflect a completed 
> move.
> - Any partially created destination replica should be cleaned up.
>  
> *Actual behavior*
> - successCount is incremented and successBytes is updated.
> - Log message: "Failed to mark the old container <id> for delete. It will be
> handled after DN restart."
> - ContainerSet points to the new replica on the destination volume.
> - The old source replica directory still exists on disk.
> - The old replica is still queued for delayed deletion.
>  
> *Impact*
> - Operators see a successful move in DiskBalancer metrics when it did not
> fully complete.
> - Duplicate replicas can consume extra disk space on the datanode.
> - Source volume may stay over-utilized and balancing may not progress as
> expected until datanode restart.
>  
> *Suggested fix*
> Only mark the move as successful after markContainerForDelete() succeeds. If
> mark fails, roll back the move: restore ContainerSet to the source replica,
> revert destination volume used space, and delete the destination replica
> directory. Do not update success metrics or queue the old replica for 
> deletion.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (HDDS-15651) [DiskBalancer] markContainerForDelete failure is treated as a successful move

Reply via email to