Gargi Jaiswal created HDDS-15524:
------------------------------------

             Summary: [DiskBalancer] Container parallel moves can overwrite 
pending source replica deletions
                 Key: HDDS-15524
                 URL: https://issues.apache.org/jira/browse/HDDS-15524
             Project: Apache Ozone
          Issue Type: Sub-task
    Affects Versions: 2.2.0
            Reporter: Gargi Jaiswal
            Assignee: Gargi Jaiswal


In {{DiskBalancerService.java}} *at lines 642-643 :*
 
{code:java}
pendingDeletionContainers.put(clock.millis() + replicaDeletionDelay, 
container);{code}



{quote}After a successful container move, the old replica is queued for 
deletion in {{{}pendingDeletionContainers{}}}, keyed by {{{}clock.millis() + 
replicaDeletionDelay{}}}. That key has only millisecond precision, so if two 
moves finish in the same millisecond, they get the same key.
 Because the {*}map stores one container per key{*}, the second {{put()}} 
overwrites the first. The overwritten container is never scheduled for 
deletion, so its old replica stays on disk and wastes space. With 
{{{}parallelThread > 1{}}}, this is realistic under normal load.{quote}


The key is: *{color:#172b4d}{{clock.millis() + replicaDeletionDelay}}{color}*
Both parts are the same for every thread finishing at the same millisecond: * 
{{{}*clock.millis()*{}}}{*}—{*} wall clock, millisecond resolution. All JVM 
threads share the same clock.
 * {{{}*replicaDeletionDelay*{}}}{*}—{*} a single constant (default 5 minutes = 
300,000 ms) shared by the whole service.



*Step-by-step analysis*
Assume {color:#00875a}{{replicaDeletionDelay = 300,000 ms}}{color} 
and{color:#00875a} {{parallelThread = 5}}{color}.
Five container moves run in parallel. Moves for containers C-101 and C-202 both 
finish at {color:#00875a}{{clock.millis() = 1,000,000}}{color}:
 
{code:java}
Thread-1 (moving C-101):
key = 1,000,000 + 300,000 = 1,300,000
pendingDeletionContainers.put(1_300_000, C-101_old_replica)
Map now: { 1_300_000 → C-101_old }
 
Thread-2 (moving C-202), same millisecond:
key = 1,000,000 + 300,000 = 1,300,000           <------ identical key! 
pendingDeletionContainers.put(1_300_000, C-202_old_replica)
Map now: { 1_300_000 → C-202_old }     <--------- C-101_old silently GONE{code}
C-101's old replica has been permanently lost from the pending-deletion queue. 
It will never be scheduled for deletion. * C-202's old replica gets deleted 
correctly.
 * C-101's old replica is never visited. It sits on the source disk forever.
 * The container is marked {{{}DELETED{}}}in metadata (line 605), so it won't 
be served.
 * But its data 
directory(chunks,{{{}container.db{}}},{{{}.container{}}}descriptor) remains on 
the source disk.
 * {{{}decrementUsedSpace{}}}is only called inside{{{}deleteContainer(){}}}, so 
the source volume'sused-space counter is never corrected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to