[ 
https://issues.apache.org/jira/browse/HDDS-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarveksha Yeshavantha Raju resolved HDDS-13698.
-----------------------------------------------
    Resolution: Duplicate

> Race condition in Container Balancer start/stop HA flow
> -------------------------------------------------------
>
>                 Key: HDDS-13698
>                 URL: https://issues.apache.org/jira/browse/HDDS-13698
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 2.0.0
>            Reporter: Siddhant Sangwan
>            Assignee: Sarveksha Yeshavantha Raju
>            Priority: Major
>
> 1. When a leader steps down it stops the balancer thread locally but does not 
> flip the persisted flag (shouldRun stays true).
> Result: task != null, taskStatus == STOPPED.
> 2. When that SCM later regains leadership its notifyStatusChanged() thread 
> reads shouldRun = true and – because taskStatus == STOPPED – starts a new 
> balancer thread.
> 3. If, in the same time-window, an administrator issues stopBalancer from the 
> CLI, that method
>  - acquires the same lock first,
>  - calls validateState(true) which expects the balancer to be RUNNING,
>  - finds it STOPPED and throws an exception before persisting shouldRun = 
> false.
> 4. The command silently fails and the balancer continues to run, when it 
> should have actually stopped.
> h2. Proposed fix:
> Split the current validateState(boolean expectedRunning) into two methods:
> 1. validateEligibility() – checks leader-ready and safe-mode only.
> 2. validateState(expectedRunning) – delegates to validateEligibility() and 
> then performs the running / stopped assertions.
> 3. Change stopBalancer() to call validateEligibility() instead of 
> validateState(true), persist shouldRun = false before looking at taskStatus, 
> and then interrupt a running task if present.
> Roughly how the changes look like in code:
> {code:java}
> private void validateEligibility() throws 
> IllegalContainerBalancerStateException {
>   if (!scmContext.isLeaderReady()) {
>     LOG.warn("SCM is not leader ready");
>     throw new IllegalContainerBalancerStateException("SCM is not leader " +
>           "ready");
>   }
>   if (scmContext.isInSafeMode()) {
>     LOG.warn("SCM is in safe mode");
>     throw new IllegalContainerBalancerStateException("SCM is in safe mode");
>   }
> }
> private void validateState(boolean expectedRunning) throws 
> IllegalContainerBalancerStateException {
>   validateEligibility();
>   if (!expectedRunning && !canBalancerStart()) {
>     ...
>   }
>   if (expectedRunning && !canBalancerStop()) {
>     ...
>   }
> }
> public void stopBalancer()
>     throws IOException, IllegalContainerBalancerStateException {
>   Thread balancingThread = null;
>   lock.lock();
>   try {
>     validateEligibility();               // only leadership / safemode
>     saveConfiguration(config, false, 0);
>     
>     if (isBalancerRunning()) {
>       LOG.info("Trying to stop ContainerBalancer service.");
>       task.stop();
>       balancingThread = currentBalancingThread;
>     }
>   } finally {
>     lock.unlock();
>   }
>   if (balancingThread != null) {
>     blockTillTaskStop(balancingThread);
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to