[ 
https://issues.apache.org/jira/browse/HDDS-15066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gargi Jaiswal resolved HDDS-15066.
----------------------------------
    Fix Version/s: 2.0.0
       Resolution: Fixed

> Read-Write Lock race leave stale references to container creating orphan 
> replicas
> ---------------------------------------------------------------------------------
>
>                 Key: HDDS-15066
>                 URL: https://issues.apache.org/jira/browse/HDDS-15066
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Gargi Jaiswal
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>
> On a datanode, some work runs under a container {color:#de350b}read 
> lock{color} (or otherwise changes which replica directory / in-memory 
> {{Container}} is authoritative) while other threads look up a container once 
> and later take a {color:#de350b}write lock{color}. If the in-memory container 
> mapping or on-disk location changes in between, the second thread can still 
> use a stale *{{Container}}* or *{{KeyValueContainerData}}* reference. That is 
> a classic *TOCTOU* problem: the map and the caller disagree about where the 
> replica lives.
> Worst cases include wrong {{*ContainerSet*}} updates, deleting or updating 
> the wrong paths, ghost / orphan replica data on disk that SCM no longer 
> tracks, and block deletion targeting the wrong RocksDB/chunks tree so pending 
> deletes on the live replica are never applied (space not reclaimed).
> Applies to CLOSED and QUASI_CLOSED (and any path where balancing/replication 
> overlaps lifecycle commands), not a single state.
> *This can be easily explained with help of DiskBalancer as an example, 
> although it applies to other read write races as well:*
> While {color:#de350b}DiskBalance{color}r moves a container between volumes on 
> the same datanode, it {*}holds the container read lock{*}, copies data, then 
> calls *{{ContainerSet.updateContainer(...)}}* to point the *in-memory map* at 
> a new {{Container}} instance (new volume / paths).
> Whereas it is seen that other threads often look up the {*}container once{*}, 
> then block {color:#de350b}on {{writeLock()}}{color} until the move finishes. 
> After they unblock, they still hold a reference to the old container (source 
> volume). They then run logic and/or *{{removeContainer(containerId)}}* using 
> stale paths and stale object identity.
> *Impacts:*
>  * *Replication Manager {{{}DeleteContainerCommand{}}}:* can remove the wrong 
> map entry (live replica on the destination volume), delete/move source files, 
> and leave an orphan replica on disk that SCM no longer sees — plus wrong 
> volume accounting and ICRs keyed off the old replica.
>  * *{{{}BlockDeletingTask{}}}:* uses a task-scoped 
> *{{KeyValueContainerData}}* snapshot (source paths) while {{getContainer()}} 
> may return the new replica; block cleanup can target the wrong chunks path 
> and never clear pending-deletion state on the live replica → space not 
> reclaimed.
> *Suggested Fix:*
> For RM side: After acquiring writeLock(), {*}re-fetch from the map and 
> compare by identity{*}. If they differ, the container was moved — abort and 
> let the caller retry.
> {code:java}
> container.writeLock();
> try {    
> Container<?> current containerSet.getContainer( 
> container.getContainerData().getContainerID());
> if (current != container) { 
>        
> // Container was relocated by DiskBalancer while we waited for the lock.Our 
> reference is stale; the operation will be retried on the next cycle. 
>       
>  return;    
>  } 
>  // ... proceed with the actual operation
> }{code}
> For BlockDeletingService:
> {code:java}
> container.writeLock();
> // Re-fetch AFTER lock: map already has disk-B in scenarios.
> // current.getContainerData() is disk-B's ContainerData ≠ this.containerData 
> (disk-A snapshot).
> Container<?> current = containerSet.getContainer(containerData
> .getContainerID());
> if (current == null || current.getContainerData() != containerData) {   
> // containerData is stale — DiskBalancer relocated this container. 
> BlockDeletingService will reschedule with disk-B's fresh containerData.
> return;
> } {code}
> For DiskBalancer:
> Move the container state check after acquiring 
> {color:#de350b}readLock(){color} to prevent stale references about container.
> {code:java}
> container.readLock();
> try {
> // Double check container state before acquiring lock to start move process.
> // Container state may have changed after selection.
> State containerState = container.getContainerData().getState();
> if (!movableContainerStates.contains(containerState)) {
>   LOG.warn("Container {} is in {} state, skipping move process.", 
> containerId, containerState);
>   postCall(false, startTime);
>   return BackgroundTaskResult.EmptyTaskResult.newResult();
> }   // Step 1: Copy container to new Volume's tmp Dir
>   diskBalancerTmpDir = getDiskBalancerTmpDir(destVolume)
>       .resolve(String.valueOf(containerId));
>   ozoneContainer.getController().copyContainer(containerData, 
> diskBalancerTmpDir); 
> ....{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to