[jira] [Updated] (HDDS-15524) [DiskBalancer] Container parallel moves can overwrite pending source replica deletions

ASF GitHub Bot (Jira) Wed, 10 Jun 2026 23:06:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-15524:
----------------------------------
    Labels: pull-request-available  (was: )

> [DiskBalancer] Container parallel moves can overwrite pending source replica 
> deletions
> --------------------------------------------------------------------------------------
>
>                 Key: HDDS-15524
>                 URL: https://issues.apache.org/jira/browse/HDDS-15524
>             Project: Apache Ozone
>          Issue Type: Sub-task
>    Affects Versions: 2.2.0
>            Reporter: Gargi Jaiswal
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>
> In {{DiskBalancerService.java}} *at lines 642-643 :*
>  
> {code:java}
> pendingDeletionContainers.put(clock.millis() + replicaDeletionDelay, 
> container);{code}
> {quote}After a successful container move, the old replica is queued for 
> deletion in {{{}pendingDeletionContainers{}}}, keyed by {{{}clock.millis() + 
> replicaDeletionDelay{}}}. That key has only millisecond precision, so if two 
> moves finish in the same millisecond, they get the same key.
>  Because the {*}map stores one container per key{*}, the second {{put()}} 
> overwrites the first. The overwritten container is never scheduled for 
> deletion, so its old replica stays on disk and wastes space. With 
> {{{}parallelThread > 1{}}}, this is realistic under normal load.{quote}
> The key is: *{color:#172b4d}{{clock.millis() + replicaDeletionDelay}}{color}*
> Both parts are the same for every thread finishing at the same millisecond: * 
> {{{}*clock.millis()*{}}}{*}—{*} wall clock, millisecond resolution. All JVM 
> threads share the same clock.
>  * {{{}*replicaDeletionDelay*{}}}{*}—{*} a single constant (default 5 minutes 
> = 300,000 ms) shared by the whole service.
> *Step-by-step analysis*
> Assume {color:#00875a}{{replicaDeletionDelay = 300,000 ms}}{color} 
> and{color:#00875a} {{parallelThread = 5}}{color}.
> Five container moves run in parallel. Moves for containers C-101 and C-202 
> both finish at {color:#00875a}{{clock.millis() = 1,000,000}}{color}:
>  
> {code:java}
> Thread-1 (moving C-101):
> key = 1,000,000 + 300,000 = 1,300,000
> pendingDeletionContainers.put(1_300_000, C-101_old_replica)
> Map now: { 1_300_000 → C-101_old }
>  
> Thread-2 (moving C-202), same millisecond:
> key = 1,000,000 + 300,000 = 1,300,000           <------ identical key! 
> pendingDeletionContainers.put(1_300_000, C-202_old_replica)
> Map now: { 1_300_000 → C-202_old }     <--------- C-101_old silently 
> GONE{code}
> C-101's old replica has been permanently lost from the pending-deletion 
> queue. It will never be scheduled for deletion. * C-202's old replica gets 
> deleted correctly.
>  * C-101's old replica is never visited. It sits on the source disk forever.
>  * The container is marked {{{}DELETED{}}}in metadata (line 605), so it won't 
> be served.
>  * But its data 
> directory(chunks,{{{}container.db{}}},{{{}.container{}}}descriptor) remains 
> on the source disk.
>  * {{{}decrementUsedSpace{}}}is only called inside{{{}deleteContainer(){}}}, 
> so the source volume'sused-space counter is never corrected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15524) [DiskBalancer] Container parallel moves can overwrite pending source replica deletions

Reply via email to