[
https://issues.apache.org/jira/browse/HDDS-15524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-15524:
----------------------------------
Labels: pull-request-available (was: )
> [DiskBalancer] Container parallel moves can overwrite pending source replica
> deletions
> --------------------------------------------------------------------------------------
>
> Key: HDDS-15524
> URL: https://issues.apache.org/jira/browse/HDDS-15524
> Project: Apache Ozone
> Issue Type: Sub-task
> Affects Versions: 2.2.0
> Reporter: Gargi Jaiswal
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
>
> In {{DiskBalancerService.java}} *at lines 642-643 :*
>
> {code:java}
> pendingDeletionContainers.put(clock.millis() + replicaDeletionDelay,
> container);{code}
> {quote}After a successful container move, the old replica is queued for
> deletion in {{{}pendingDeletionContainers{}}}, keyed by {{{}clock.millis() +
> replicaDeletionDelay{}}}. That key has only millisecond precision, so if two
> moves finish in the same millisecond, they get the same key.
> Because the {*}map stores one container per key{*}, the second {{put()}}
> overwrites the first. The overwritten container is never scheduled for
> deletion, so its old replica stays on disk and wastes space. With
> {{{}parallelThread > 1{}}}, this is realistic under normal load.{quote}
> The key is: *{color:#172b4d}{{clock.millis() + replicaDeletionDelay}}{color}*
> Both parts are the same for every thread finishing at the same millisecond: *
> {{{}*clock.millis()*{}}}{*}—{*} wall clock, millisecond resolution. All JVM
> threads share the same clock.
> * {{{}*replicaDeletionDelay*{}}}{*}—{*} a single constant (default 5 minutes
> = 300,000 ms) shared by the whole service.
> *Step-by-step analysis*
> Assume {color:#00875a}{{replicaDeletionDelay = 300,000 ms}}{color}
> and{color:#00875a} {{parallelThread = 5}}{color}.
> Five container moves run in parallel. Moves for containers C-101 and C-202
> both finish at {color:#00875a}{{clock.millis() = 1,000,000}}{color}:
>
> {code:java}
> Thread-1 (moving C-101):
> key = 1,000,000 + 300,000 = 1,300,000
> pendingDeletionContainers.put(1_300_000, C-101_old_replica)
> Map now: { 1_300_000 → C-101_old }
>
> Thread-2 (moving C-202), same millisecond:
> key = 1,000,000 + 300,000 = 1,300,000 <------ identical key!
> pendingDeletionContainers.put(1_300_000, C-202_old_replica)
> Map now: { 1_300_000 → C-202_old } <--------- C-101_old silently
> GONE{code}
> C-101's old replica has been permanently lost from the pending-deletion
> queue. It will never be scheduled for deletion. * C-202's old replica gets
> deleted correctly.
> * C-101's old replica is never visited. It sits on the source disk forever.
> * The container is marked {{{}DELETED{}}}in metadata (line 605), so it won't
> be served.
> * But its data
> directory(chunks,{{{}container.db{}}},{{{}.container{}}}descriptor) remains
> on the source disk.
> * {{{}decrementUsedSpace{}}}is only called inside{{{}deleteContainer(){}}},
> so the source volume'sused-space counter is never corrected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]