Kohei Sugihara created HDDS-7925:
------------------------------------
Summary: Potential deadlocks among all OMs in OM HA
Key: HDDS-7925
URL: https://issues.apache.org/jira/browse/HDDS-7925
Project: Apache Ozone
Issue Type: Bug
Components: OM HA
Affects Versions: 1.3.0
Environment: Configuration: FSO enabled, OM HA, SCM HA
Reporter: Kohei Sugihara
In our environment, from December 2022 to January 2023, we met a timeout
problem several times within the window; all OMs stopped responding to OM RPCs,
such as listing keys, getting keys, and getting OM roles. In all cases,
rebooting OMs in the appropriate order will recover the service, but every time
it recurred within a few days. We dug logs and traces about all OMs and noticed
IPC Server Handler in every OM is full of OM requests waiting for either
OzoneManagerLock or Ratis, including OM Ratis. Switching the OMs to non-HA and
running for two weeks, the problem has not recurred.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]