[
https://issues.apache.org/jira/browse/HDDS-13698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sarveksha Yeshavantha Raju resolved HDDS-13698.
-----------------------------------------------
Resolution: Duplicate
> Race condition in Container Balancer start/stop HA flow
> -------------------------------------------------------
>
> Key: HDDS-13698
> URL: https://issues.apache.org/jira/browse/HDDS-13698
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Affects Versions: 2.0.0
> Reporter: Siddhant Sangwan
> Assignee: Sarveksha Yeshavantha Raju
> Priority: Major
>
> 1. When a leader steps down it stops the balancer thread locally but does not
> flip the persisted flag (shouldRun stays true).
> Result: task != null, taskStatus == STOPPED.
> 2. When that SCM later regains leadership its notifyStatusChanged() thread
> reads shouldRun = true and – because taskStatus == STOPPED – starts a new
> balancer thread.
> 3. If, in the same time-window, an administrator issues stopBalancer from the
> CLI, that method
> - acquires the same lock first,
> - calls validateState(true) which expects the balancer to be RUNNING,
> - finds it STOPPED and throws an exception before persisting shouldRun =
> false.
> 4. The command silently fails and the balancer continues to run, when it
> should have actually stopped.
> h2. Proposed fix:
> Split the current validateState(boolean expectedRunning) into two methods:
> 1. validateEligibility() – checks leader-ready and safe-mode only.
> 2. validateState(expectedRunning) – delegates to validateEligibility() and
> then performs the running / stopped assertions.
> 3. Change stopBalancer() to call validateEligibility() instead of
> validateState(true), persist shouldRun = false before looking at taskStatus,
> and then interrupt a running task if present.
> Roughly how the changes look like in code:
> {code:java}
> private void validateEligibility() throws
> IllegalContainerBalancerStateException {
> if (!scmContext.isLeaderReady()) {
> LOG.warn("SCM is not leader ready");
> throw new IllegalContainerBalancerStateException("SCM is not leader " +
> "ready");
> }
> if (scmContext.isInSafeMode()) {
> LOG.warn("SCM is in safe mode");
> throw new IllegalContainerBalancerStateException("SCM is in safe mode");
> }
> }
> private void validateState(boolean expectedRunning) throws
> IllegalContainerBalancerStateException {
> validateEligibility();
> if (!expectedRunning && !canBalancerStart()) {
> ...
> }
> if (expectedRunning && !canBalancerStop()) {
> ...
> }
> }
> public void stopBalancer()
> throws IOException, IllegalContainerBalancerStateException {
> Thread balancingThread = null;
> lock.lock();
> try {
> validateEligibility(); // only leadership / safemode
> saveConfiguration(config, false, 0);
>
> if (isBalancerRunning()) {
> LOG.info("Trying to stop ContainerBalancer service.");
> task.stop();
> balancingThread = currentBalancingThread;
> }
> } finally {
> lock.unlock();
> }
> if (balancingThread != null) {
> blockTillTaskStop(balancingThread);
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]