[
https://issues.apache.org/jira/browse/HDDS-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gargi Jaiswal resolved HDDS-15066.
----------------------------------
Fix Version/s: 2.0.0
Resolution: Fixed
> Read-Write Lock race leave stale references to container creating orphan
> replicas
> ---------------------------------------------------------------------------------
>
> Key: HDDS-15066
> URL: https://issues.apache.org/jira/browse/HDDS-15066
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Gargi Jaiswal
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.0
>
>
> On a datanode, some work runs under a container {color:#de350b}read
> lock{color} (or otherwise changes which replica directory / in-memory
> {{Container}} is authoritative) while other threads look up a container once
> and later take a {color:#de350b}write lock{color}. If the in-memory container
> mapping or on-disk location changes in between, the second thread can still
> use a stale *{{Container}}* or *{{KeyValueContainerData}}* reference. That is
> a classic *TOCTOU* problem: the map and the caller disagree about where the
> replica lives.
> Worst cases include wrong {{*ContainerSet*}} updates, deleting or updating
> the wrong paths, ghost / orphan replica data on disk that SCM no longer
> tracks, and block deletion targeting the wrong RocksDB/chunks tree so pending
> deletes on the live replica are never applied (space not reclaimed).
> Applies to CLOSED and QUASI_CLOSED (and any path where balancing/replication
> overlaps lifecycle commands), not a single state.
> *This can be easily explained with help of DiskBalancer as an example,
> although it applies to other read write races as well:*
> While {color:#de350b}DiskBalance{color}r moves a container between volumes on
> the same datanode, it {*}holds the container read lock{*}, copies data, then
> calls *{{ContainerSet.updateContainer(...)}}* to point the *in-memory map* at
> a new {{Container}} instance (new volume / paths).
> Whereas it is seen that other threads often look up the {*}container once{*},
> then block {color:#de350b}on {{writeLock()}}{color} until the move finishes.
> After they unblock, they still hold a reference to the old container (source
> volume). They then run logic and/or *{{removeContainer(containerId)}}* using
> stale paths and stale object identity.
> *Impacts:*
> * *Replication Manager {{{}DeleteContainerCommand{}}}:* can remove the wrong
> map entry (live replica on the destination volume), delete/move source files,
> and leave an orphan replica on disk that SCM no longer sees — plus wrong
> volume accounting and ICRs keyed off the old replica.
> * *{{{}BlockDeletingTask{}}}:* uses a task-scoped
> *{{KeyValueContainerData}}* snapshot (source paths) while {{getContainer()}}
> may return the new replica; block cleanup can target the wrong chunks path
> and never clear pending-deletion state on the live replica → space not
> reclaimed.
> *Suggested Fix:*
> For RM side: After acquiring writeLock(), {*}re-fetch from the map and
> compare by identity{*}. If they differ, the container was moved — abort and
> let the caller retry.
> {code:java}
> container.writeLock();
> try {
> Container<?> current containerSet.getContainer(
> container.getContainerData().getContainerID());
> if (current != container) {
>
> // Container was relocated by DiskBalancer while we waited for the lock.Our
> reference is stale; the operation will be retried on the next cycle.
>
> return;
> }
> // ... proceed with the actual operation
> }{code}
> For BlockDeletingService:
> {code:java}
> container.writeLock();
> // Re-fetch AFTER lock: map already has disk-B in scenarios.
> // current.getContainerData() is disk-B's ContainerData ≠ this.containerData
> (disk-A snapshot).
> Container<?> current = containerSet.getContainer(containerData
> .getContainerID());
> if (current == null || current.getContainerData() != containerData) {
> // containerData is stale — DiskBalancer relocated this container.
> BlockDeletingService will reschedule with disk-B's fresh containerData.
> return;
> } {code}
> For DiskBalancer:
> Move the container state check after acquiring
> {color:#de350b}readLock(){color} to prevent stale references about container.
> {code:java}
> container.readLock();
> try {
> // Double check container state before acquiring lock to start move process.
> // Container state may have changed after selection.
> State containerState = container.getContainerData().getState();
> if (!movableContainerStates.contains(containerState)) {
> LOG.warn("Container {} is in {} state, skipping move process.",
> containerId, containerState);
> postCall(false, startTime);
> return BackgroundTaskResult.EmptyTaskResult.newResult();
> } // Step 1: Copy container to new Volume's tmp Dir
> diskBalancerTmpDir = getDiskBalancerTmpDir(destVolume)
> .resolve(String.valueOf(containerId));
> ozoneContainer.getController().copyContainer(containerData,
> diskBalancerTmpDir);
> ....{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]