[
https://issues.apache.org/jira/browse/HDDS-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
István Fajth updated HDDS-9061:
-------------------------------
Description:
Current config set under safety valve for ca cert rotation :
{code:java}
<property><name>hdds.x509.max.duration</name><value>P5D</value></property>
<property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
<property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
<property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
<property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
{code}
based on above config the certificates should renew at 10th minutes from start
by the grace period coming into play.
Observation :
Certificates were renewed for a few roles or instances at random.
Primordial SCM was stopped and started at the below times :
2023-07-21 15:36:03,703 : STOPPED
2023-07-21 16:00:25,736 : STARTED
{code:java}
[root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
66603508550073199 Fri Jul 21 15:02:32 UTC 2023 Fri Jul 21 16:02:32 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509087676398 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509333666132 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509791941467 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510052147817 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510180218825 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510614053924 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603512588421508 Fri Jul 21 15:02:36 UTC 2023 Fri Jul 21 16:02:36 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604108035744075 Fri Jul 21 15:12:32 UTC 2023 Fri Jul 21 16:12:32 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109383514945 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109450607224 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109820303351 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604110216800118 Fri Jul 21 15:12:34 UTC 2023 Fri Jul 21 16:12:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
{code}
UPDATE with Root cause:
After checking first, in the thread dump of the leader SCM, we saw these
threads that are interesting:
{code}
"main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M
defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable
[0x00007f516b4e9000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait([email protected]/Native Method)
at
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
at
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
- locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
- locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
at
sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
- locked <0x00000007058d3a20> (a
org.apache.hadoop.ipc.Client$Connection)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
- locked <0x00000007058d3a20> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
at org.apache.hadoop.ipc.Client.call(Client.java:1506)
at org.apache.hadoop.ipc.Client.call(Client.java:1459)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
- locked <0x000000072d460c08> (a
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
- locked <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
- locked <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
{code}
{code}
"IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms
elapsed=206369.56s allocated=11121K defined_classes=216 tid=0x00007f5166457800
nid=0x2082b5 waiting for monitor entry [0x00007f511fba8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
- waiting to lock <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
- locked <0x000000072cf80778> (a
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
Source)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged([email protected]/Native
Method)
at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}
{code}
"IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms
elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800
nid=0x2082b6 waiting for monitor entry [0x00007f511faa7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
- waiting to lock <0x000000072cf80778> (a
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
Source)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged([email protected]/Native
Method)
at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}
Looking at it first these are not trivially causing a dead-lock, but further
analysis of the environment shows that:
- main thread has locked on SCMCertificateClient instance as part of the listCA
call. Lock object id: 0x000000072a05b650
- The server handler #0 for port 9961 waits on locking on SCMCertificateClient
as part of the getRootCACertificate while it has already locked
SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is
0x000000072cf80778
- The server handler #1 for port 9961 waits on locking on
SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is
0x000000072cf80778
with that we have depleted the handler threads for the
SCMSecurityProtocolServer (by default we have 2), so main thread can not submit
the request, and as the retry policy is infinite retry, it waits for some time
in between retries, and that is the EPoll wait on which the main thread is
waiting in RUNNABLE state, while the other two threads are BLOCKED.
This situation in our test environment came up during a cluster restart after
some SCM node decommission then a new SCM node added, and this restart was
taking time, during which service certificates entered the grace period that
marks their end of life, and restarting services started to renew their
certificates while the leader SCM started up, so with that within the leader
SCM, we got into the detailed state. This pretty much blocked the whole
startup, certificates could not be renewed, and with that the cluster became
unusable.
was:
Current config set under safety valve for ca cert rotation :
{code:java}
<property><name>hdds.x509.max.duration</name><value>P5D</value></property>
<property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
<property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
<property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
<property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
{code}
based on above config the certificates should renew at 10th minutes from start
by the grace period coming into play.
Observation :
Certificates were renewed for a few roles or instances at random.
Primordial SCM was stopped and started at the below times :
2023-07-21 15:36:03,703 : STOPPED
2023-07-21 16:00:25,736 : STARTED
{code:java}
[root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
66603508550073199 Fri Jul 21 15:02:32 UTC 2023 Fri Jul 21 16:02:32 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509087676398 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509333666132 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509791941467 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510052147817 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510180218825 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510614053924 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603512588421508 Fri Jul 21 15:02:36 UTC 2023 Fri Jul 21 16:02:36 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604108035744075 Fri Jul 21 15:12:32 UTC 2023 Fri Jul 21 16:12:32 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109383514945 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109450607224 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109820303351 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604110216800118 Fri Jul 21 15:12:34 UTC 2023 Fri Jul 21 16:12:34 UTC 2023
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
{code}
UPDATE with Root cause:
After checking first, in the thread dump of the leader SCM, we saw these
threads that are interesting:
{code}
"main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M
defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable
[0x00007f516b4e9000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait([email protected]/Native Method)
at
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
at
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
- locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
- locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
at
sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
at
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
at
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
- locked <0x00000007058d3a20> (a
org.apache.hadoop.ipc.Client$Connection)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
- locked <0x00000007058d3a20> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
at org.apache.hadoop.ipc.Client.call(Client.java:1506)
at org.apache.hadoop.ipc.Client.call(Client.java:1459)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
- locked <0x000000072d460c08> (a
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
at
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
- locked <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
- locked <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
at picocli.CommandLine.execute(CommandLine.java:2078)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
{code}
{code}
"IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms
elapsed=206369.56s allocated=11121K defined_classes=216 tid=0x00007f5166457800
nid=0x2082b5 waiting for monitor entry [0x00007f511fba8000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
- waiting to lock <0x000000072a05b650> (a
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
at
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
- locked <0x000000072cf80778> (a
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
Source)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged([email protected]/Native
Method)
at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}
{code}
"IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms
elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800
nid=0x2082b6 waiting for monitor entry [0x00007f511faa7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
- waiting to lock <0x000000072cf80778> (a
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
Source)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
at java.security.AccessController.doPrivileged([email protected]/Native
Method)
at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}
Looking at it first these are not trivially causing a dead-lock, but further
analysis of the environment shows that:
- main thread has locked on SCMCertificateClient instance as part of the listCA
call. Lock object id: 0x000000072a05b650
- The server handler #0 for port 9961 waits on locking on SCMCertificateClient
as part of the getRootCACertificate while it has already locked
SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is
0x000000072cf80778
- The server handler #1 for port 9961 waits on locking on
SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is
0x000000072cf80778
with that we have depleted the handler threads for the
SCMSecurityProtocolServer (by default we have 2), so main thread can not submit
the request, and as the retry policy is infinite retry, it waits for some time
in between retries, and that is the EPoll wait on which the main thread is
waiting in RUNNABLE state, while the other two threads are BLOCKED.
> Possible dead-lock in SCM initialization due to certificate operations.
> -----------------------------------------------------------------------
>
> Key: HDDS-9061
> URL: https://issues.apache.org/jira/browse/HDDS-9061
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM
> Reporter: Soumitra Sulav
> Assignee: Sammi Chen
> Priority: Critical
> Fix For: 1.4.0
>
> Attachments: quasar-kkgvze-1-datanode.log,
> quasar-kkgvze-1-ozone-scm.log, quasar-kkgvze-2-datanode.log,
> quasar-kkgvze-2-ozone-scm.log, quasar-kkgvze-6-ozone-scm.log
>
>
> Current config set under safety valve for ca cert rotation :
> {code:java}
> <property><name>hdds.x509.max.duration</name><value>P5D</value></property>
> <property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
> <property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
> <property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
> <property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
> {code}
> based on above config the certificates should renew at 10th minutes from
> start by the grace period coming into play.
> Observation :
> Certificates were renewed for a few roles or instances at random.
> Primordial SCM was stopped and started at the below times :
> 2023-07-21 15:36:03,703 : STOPPED
> 2023-07-21 16:00:25,736 : STARTED
> {code:java}
> [root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
> 66603508550073199 Fri Jul 21 15:02:32 UTC 2023 Fri Jul 21 16:02:32 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509087676398 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509333666132 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509791941467 Fri Jul 21 15:02:33 UTC 2023 Fri Jul 21 16:02:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510052147817 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510180218825 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510614053924 Fri Jul 21 15:02:34 UTC 2023 Fri Jul 21 16:02:34 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603512588421508 Fri Jul 21 15:02:36 UTC 2023 Fri Jul 21 16:02:36 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604108035744075 Fri Jul 21 15:12:32 UTC 2023 Fri Jul 21 16:12:32 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109383514945 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109450607224 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109820303351 Fri Jul 21 15:12:33 UTC 2023 Fri Jul 21 16:12:33 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604110216800118 Fri Jul 21 15:12:34 UTC 2023 Fri Jul 21 16:12:34 UTC 2023
>
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> {code}
> UPDATE with Root cause:
> After checking first, in the thread dump of the leader SCM, we saw these
> threads that are interesting:
> {code}
> "main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M
> defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable
> [0x00007f516b4e9000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPoll.wait([email protected]/Native Method)
> at
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
> at
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
> - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
> - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
> at
> sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
> at
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
> - locked <0x00000007058d3a20> (a
> org.apache.hadoop.ipc.Client$Connection)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
> - locked <0x00000007058d3a20> (a
> org.apache.hadoop.ipc.Client$Connection)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
> at org.apache.hadoop.ipc.Client.call(Client.java:1506)
> at org.apache.hadoop.ipc.Client.call(Client.java:1459)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
> at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown
> Source)
> at
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> - locked <0x000000072d460c08> (a
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
> at
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
> at
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
> at
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
> - locked <0x000000072a05b650> (a
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
> at
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
> - locked <0x000000072a05b650> (a
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
> at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
> at picocli.CommandLine.access$1300(CommandLine.java:145)
> at
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
> at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
> at
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
> at picocli.CommandLine.execute(CommandLine.java:2078)
> at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
> at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
> {code}
> {code}
> "IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms
> elapsed=206369.56s allocated=11121K defined_classes=216
> tid=0x00007f5166457800 nid=0x2082b5 waiting for monitor entry
> [0x00007f511fba8000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
> - waiting to lock <0x000000072a05b650> (a
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
> at
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
> - locked <0x000000072cf80778> (a
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
> Source)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
> at
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at
> java.security.AccessController.doPrivileged([email protected]/Native Method)
> at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> {code}
> "IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms
> elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800
> nid=0x2082b6 waiting for monitor entry [0x00007f511faa7000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
> - waiting to lock <0x000000072cf80778> (a
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
> Source)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
> at
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
> at
> java.security.AccessController.doPrivileged([email protected]/Native Method)
> at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> Looking at it first these are not trivially causing a dead-lock, but further
> analysis of the environment shows that:
> - main thread has locked on SCMCertificateClient instance as part of the
> listCA call. Lock object id: 0x000000072a05b650
> - The server handler #0 for port 9961 waits on locking on
> SCMCertificateClient as part of the getRootCACertificate while it has already
> locked SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is
> 0x000000072cf80778
> - The server handler #1 for port 9961 waits on locking on
> SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is
> 0x000000072cf80778
> with that we have depleted the handler threads for the
> SCMSecurityProtocolServer (by default we have 2), so main thread can not
> submit the request, and as the retry policy is infinite retry, it waits for
> some time in between retries, and that is the EPoll wait on which the main
> thread is waiting in RUNNABLE state, while the other two threads are BLOCKED.
> This situation in our test environment came up during a cluster restart after
> some SCM node decommission then a new SCM node added, and this restart was
> taking time, during which service certificates entered the grace period that
> marks their end of life, and restarting services started to renew their
> certificates while the leader SCM started up, so with that within the leader
> SCM, we got into the detailed state. This pretty much blocked the whole
> startup, certificates could not be renewed, and with that the cluster became
> unusable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]