[jira] [Created] (HDDS-10799) SCMBlockDeletingService stuck in PAUSING state

Vyacheslav Tutrinov (Jira) Thu, 02 May 2024 23:48:11 -0700

Vyacheslav Tutrinov created HDDS-10799:
------------------------------------------


             Summary: SCMBlockDeletingService stuck in PAUSING state
                 Key: HDDS-10799
                 URL: https://issues.apache.org/jira/browse/HDDS-10799
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
    Affects Versions: 1.4.0
            Reporter: Vyacheslav Tutrinov
            Assignee: Vyacheslav Tutrinov


SCM has a number of internal services (they implement the 
org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method 
for notifying the services about changes in raft or in safe mode. On testing 
the  blocks deletion service a strange behavior was detected:

* transactions flushed to DB (i.e. snapshots was taken)
* containers are closed
* BUT transactions aren't sent to DNs - and we have a number of mlns of 
non-handled blocks deletion transactions

After an investigation of the problem it appears that the event of exiting of 
the SCM from a safe mode was triggered multiple times, and eventually the 
SCMBlockDeletingService was moved to PAUSING state:

```java
# org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged
  public void notifyStatusChanged() {
    serviceLock.lock();
    try {
      if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
          serviceStatus != ServiceStatus.RUNNING) {
        safemodeExitMillis = clock.millis();
        serviceStatus = ServiceStatus.RUNNING;
      } else {
        serviceStatus = ServiceStatus.PAUSING;
      }
    } finally {
      serviceLock.unlock();
    }
  }
```

* *1st trigger*: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in 
RUNNING state -> the service has been transitioned to RUNNING state
* *2nd trigger*: SCM is LEADER, SCM is NOT in safe mode, the service IS in 
RUNNING state (as a result ofthe 1st trigger) -> the service has been 
transitioned to PAUSING state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-10799) SCMBlockDeletingService stuck in PAUSING state

Reply via email to