Kohei Sugihara created HDDS-7925:
------------------------------------

             Summary: Potential deadlocks among all OMs in OM HA
                 Key: HDDS-7925
                 URL: https://issues.apache.org/jira/browse/HDDS-7925
             Project: Apache Ozone
          Issue Type: Bug
          Components: OM HA
    Affects Versions: 1.3.0
         Environment: Configuration: FSO enabled, OM HA, SCM HA
            Reporter: Kohei Sugihara


In our environment, from December 2022 to January 2023, we met a timeout 
problem several times within the window; all OMs stopped responding to OM RPCs, 
such as listing keys, getting keys, and getting OM roles. In all cases, 
rebooting OMs in the appropriate order will recover the service, but every time 
it recurred within a few days. We dug logs and traces about all OMs and noticed 
IPC Server Handler in every OM is full of OM requests waiting for either 
OzoneManagerLock or Ratis, including OM Ratis. Switching the OMs to non-HA and 
running for two weeks, the problem has not recurred.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to