[jira] [Updated] (HDDS-10799) SCMBlockDeletingService stuck in PAUSING state

Vyacheslav Tutrinov (Jira) Thu, 02 May 2024 23:53:47 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vyacheslav Tutrinov updated HDDS-10799:
---------------------------------------
    Description: 
SCM has a number of internal services (they implement the 
org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method 
for notifying the services about changes in raft or in safe mode. On testing 
the  blocks deletion service a strange behavior was detected:

* transactions flushed to DB (i.e. snapshots was taken)
* containers are closed
* BUT transactions aren't sent to DNs - and we have a number of mlns of 
non-handled blocks deletion transactions

After an investigation of the problem it appears that the event of exiting of 
the SCM from a safe mode was triggered multiple times, and eventually the 
SCMBlockDeletingService was moved to PAUSING state:

{code:java|title=org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged}
  public void notifyStatusChanged() {
    serviceLock.lock();
    try {
      if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
          serviceStatus != ServiceStatus.RUNNING) {
        safemodeExitMillis = clock.millis();
        serviceStatus = ServiceStatus.RUNNING;
      } else {
        serviceStatus = ServiceStatus.PAUSING;
      }
    } finally {
      serviceLock.unlock();
    }
  }
{code}


* *1st trigger*: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in 
RUNNING state -> the service has been transitioned to RUNNING state
* *2nd trigger*: SCM is LEADER, SCM is NOT in safe mode, the service IS in 
RUNNING state (as a result ofthe 1st trigger) -> the service has been 
transitioned to PAUSING state

  was:
SCM has a number of internal services (they implement the 
org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a method 
for notifying the services about changes in raft or in safe mode. On testing 
the  blocks deletion service a strange behavior was detected:

* transactions flushed to DB (i.e. snapshots was taken)
* containers are closed
* BUT transactions aren't sent to DNs - and we have a number of mlns of 
non-handled blocks deletion transactions

After an investigation of the problem it appears that the event of exiting of 
the SCM from a safe mode was triggered multiple times, and eventually the 
SCMBlockDeletingService was moved to PAUSING state:

```java
# org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged
  public void notifyStatusChanged() {
    serviceLock.lock();
    try {
      if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
          serviceStatus != ServiceStatus.RUNNING) {
        safemodeExitMillis = clock.millis();
        serviceStatus = ServiceStatus.RUNNING;
      } else {
        serviceStatus = ServiceStatus.PAUSING;
      }
    } finally {
      serviceLock.unlock();
    }
  }
```

* *1st trigger*: SCM is LEADER, SCM is NOT in safe mode, the service is NOT in 
RUNNING state -> the service has been transitioned to RUNNING state
* *2nd trigger*: SCM is LEADER, SCM is NOT in safe mode, the service IS in 
RUNNING state (as a result ofthe 1st trigger) -> the service has been 
transitioned to PAUSING state


> SCMBlockDeletingService stuck in PAUSING state
> ----------------------------------------------
>
>                 Key: HDDS-10799
>                 URL: https://issues.apache.org/jira/browse/HDDS-10799
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>    Affects Versions: 1.4.0
>            Reporter: Vyacheslav Tutrinov
>            Assignee: Vyacheslav Tutrinov
>            Priority: Major
>
> SCM has a number of internal services (they implement the 
> org.apache.hadoop.hdds.scm.ha.SCMService interface). The interface has a 
> method for notifying the services about changes in raft or in safe mode. On 
> testing the  blocks deletion service a strange behavior was detected:
> * transactions flushed to DB (i.e. snapshots was taken)
> * containers are closed
> * BUT transactions aren't sent to DNs - and we have a number of mlns of 
> non-handled blocks deletion transactions
> After an investigation of the problem it appears that the event of exiting of 
> the SCM from a safe mode was triggered multiple times, and eventually the 
> SCMBlockDeletingService was moved to PAUSING state:
> {code:java|title=org.apache.hadoop.hdds.scm.block.SCMBlockDeletingService#notifyStatusChanged}
>   public void notifyStatusChanged() {
>     serviceLock.lock();
>     try {
>       if (scmContext.isLeaderReady() && !scmContext.isInSafeMode() &&
>           serviceStatus != ServiceStatus.RUNNING) {
>         safemodeExitMillis = clock.millis();
>         serviceStatus = ServiceStatus.RUNNING;
>       } else {
>         serviceStatus = ServiceStatus.PAUSING;
>       }
>     } finally {
>       serviceLock.unlock();
>     }
>   }
> {code}
> * *1st trigger*: SCM is LEADER, SCM is NOT in safe mode, the service is NOT 
> in RUNNING state -> the service has been transitioned to RUNNING state
> * *2nd trigger*: SCM is LEADER, SCM is NOT in safe mode, the service IS in 
> RUNNING state (as a result ofthe 1st trigger) -> the service has been 
> transitioned to PAUSING state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-10799) SCMBlockDeletingService stuck in PAUSING state

Reply via email to