Vyacheslav Tutrinov created HDDS-10799:
------------------------------------------
Summary: SCMBlockDeletingService stuck in PAUSING state
Key: HDDS-10799
URL: https://issues.apache.org/jira/browse/HDDS-10799
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Affects Versions: 1.4.0
Reporter: Vyacheslav Tutrinov
Assignee: Vyacheslav Tutrinov
SCM has a number of internal services (they implement the
org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method
for notifying the services about changes in raft or in safe mode. On testing
the blocks deletion service a strange behavior was detected:
* transactions flushed to DB (i.e. snapshots was taken)
* containers are closed
* BUT transactions aren't sent to DNs - and we have a number of mlns of
non-handled blocks deletion transactions
After an investigation of the problem it appears that the event of exiting of
the SCM from a safe mode was triggered multiple times, and eventually the
SCMBlockDeletingService was moved to PAUSING state:
```java
# org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged
public void notifyStatusChanged() {
serviceLock.lock();
try {
if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
serviceStatus != ServiceStatus.RUNNING) {
safemodeExitMillis = clock.millis();
serviceStatus = ServiceStatus.RUNNING;
} else {
serviceStatus = ServiceStatus.PAUSING;
}
} finally {
serviceLock.unlock();
}
}
```
* *1st trigger*: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in
RUNNING state -> the service has been transitioned to RUNNING state
* *2nd trigger*: SCM is LEADER, SCM is NOT in safe mode, the service IS in
RUNNING state (as a result ofthe 1st trigger) -> the service has been
transitioned to PAUSING state
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]