[ 
https://issues.apache.org/jira/browse/HDDS-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

István Fajth updated HDDS-9061:
-------------------------------
    Description: 
Current config set under safety valve for ca cert rotation :

{code:java}
<property><name>hdds.x509.max.duration</name><value>P5D</value></property>
<property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
<property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
<property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
<property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
{code}

based on above config the certificates should renew at 10th minutes from start 
by the grace period coming into play.

Observation :
Certificates were renewed for a few roles or instances at random.

Primordial SCM was stopped and started at the below times :
2023-07-21 15:36:03,703 : STOPPED
2023-07-21 16:00:25,736 : STARTED


{code:java}
[root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
66603508550073199 Fri Jul 21 15:02:32 UTC 2023   Fri Jul 21 16:02:32 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509087676398 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509333666132 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509791941467 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510052147817 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510180218825 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510614053924 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603512588421508 Fri Jul 21 15:02:36 UTC 2023   Fri Jul 21 16:02:36 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604108035744075 Fri Jul 21 15:12:32 UTC 2023   Fri Jul 21 16:12:32 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109383514945 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109450607224 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109820303351 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604110216800118 Fri Jul 21 15:12:34 UTC 2023   Fri Jul 21 16:12:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
{code}


UPDATE with Root cause:
After checking first, in the thread dump of the leader SCM, we saw these 
threads that are interesting:

{code}
"main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M 
defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable  
[0x00007f516b4e9000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPoll.wait([email protected]/Native Method)
        at 
sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
        at 
sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
        - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
        - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
        at 
sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
        at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
        - locked <0x00000007058d3a20> (a 
org.apache.hadoop.ipc.Client$Connection)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
        - locked <0x00000007058d3a20> (a 
org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
        at org.apache.hadoop.ipc.Client.call(Client.java:1506)
        at org.apache.hadoop.ipc.Client.call(Client.java:1459)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
        at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
        - locked <0x000000072d460c08> (a 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
        at 
org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
        at 
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
        - locked <0x000000072a05b650> (a 
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
        at 
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
        - locked <0x000000072a05b650> (a 
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
        at picocli.CommandLine.access$1300(CommandLine.java:145)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
        at picocli.CommandLine.execute(CommandLine.java:2078)
        at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
        at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
        at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
{code}

{code}
"IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms 
elapsed=206369.56s allocated=11121K defined_classes=216 tid=0x00007f5166457800 
nid=0x2082b5 waiting for monitor entry  [0x00007f511fba8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
        - waiting to lock <0x000000072a05b650> (a 
org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
        at 
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
        - locked <0x000000072cf80778> (a 
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
 Source)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
        at 
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
        at java.security.AccessController.doPrivileged([email protected]/Native 
Method)
        at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}

{code}
"IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms 
elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800 
nid=0x2082b6 waiting for monitor entry  [0x00007f511faa7000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
        - waiting to lock <0x000000072cf80778> (a 
org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
 Source)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
        at 
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
        at java.security.AccessController.doPrivileged([email protected]/Native 
Method)
        at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
{code}

Looking at it first these are not trivially causing a dead-lock, but further 
analysis of the environment shows that:

- main thread has locked on SCMCertificateClient instance as part of the listCA 
call. Lock object id: 0x000000072a05b650
- The server handler #0 for port 9961 waits on locking on SCMCertificateClient 
as part of the getRootCACertificate while it has already locked 
SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is 
0x000000072cf80778
- The server handler #1 for port 9961 waits on locking on 
SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is 
0x000000072cf80778

with that we have depleted the handler threads for the 
SCMSecurityProtocolServer (by default we have 2), so main thread can not submit 
the request, and as the retry policy is infinite retry, it waits for some time 
in between retries, and that is the EPoll wait on which the main thread is 
waiting in RUNNABLE state, while the other two threads are BLOCKED.

  was:
Current config set under safety valve for ca cert rotation :

{code:java}
<property><name>hdds.x509.max.duration</name><value>P5D</value></property>
<property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
<property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
<property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
<property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
<property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
{code}

based on above config the certificates should renew at 10th minutes from start 
by the grace period coming into play.

Observation :
Certificates were renewed for a few roles or instances at random.

Primordial SCM was stopped and started at the below times :
2023-07-21 15:36:03,703 : STOPPED
2023-07-21 16:00:25,736 : STARTED


{code:java}
[root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
66603508550073199 Fri Jul 21 15:02:32 UTC 2023   Fri Jul 21 16:02:32 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509087676398 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509333666132 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603509791941467 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510052147817 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510180218825 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603510614053924 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66603512588421508 Fri Jul 21 15:02:36 UTC 2023   Fri Jul 21 16:02:36 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604108035744075 Fri Jul 21 15:12:32 UTC 2023   Fri Jul 21 16:12:32 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109383514945 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109450607224 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604109820303351 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
66604110216800118 Fri Jul 21 15:12:34 UTC 2023   Fri Jul 21 16:12:34 UTC 2023   
[email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
 
CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
{code}



> Possible dead-lock in SCM initialization due to certificate operations.
> -----------------------------------------------------------------------
>
>                 Key: HDDS-9061
>                 URL: https://issues.apache.org/jira/browse/HDDS-9061
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Soumitra Sulav
>            Assignee: Sammi Chen
>            Priority: Critical
>             Fix For: 1.4.0
>
>         Attachments: quasar-kkgvze-1-datanode.log, 
> quasar-kkgvze-1-ozone-scm.log, quasar-kkgvze-2-datanode.log, 
> quasar-kkgvze-2-ozone-scm.log, quasar-kkgvze-6-ozone-scm.log
>
>
> Current config set under safety valve for ca cert rotation :
> {code:java}
> <property><name>hdds.x509.max.duration</name><value>P5D</value></property>
> <property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
> <property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
> <property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
> <property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
> {code}
> based on above config the certificates should renew at 10th minutes from 
> start by the grace period coming into play.
> Observation :
> Certificates were renewed for a few roles or instances at random.
> Primordial SCM was stopped and started at the below times :
> 2023-07-21 15:36:03,703 : STOPPED
> 2023-07-21 16:00:25,736 : STARTED
> {code:java}
> [root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
> 66603508550073199 Fri Jul 21 15:02:32 UTC 2023   Fri Jul 21 16:02:32 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509087676398 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509333666132 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509791941467 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510052147817 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510180218825 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510614053924 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603512588421508 Fri Jul 21 15:02:36 UTC 2023   Fri Jul 21 16:02:36 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604108035744075 Fri Jul 21 15:12:32 UTC 2023   Fri Jul 21 16:12:32 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109383514945 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109450607224 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109820303351 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604110216800118 Fri Jul 21 15:12:34 UTC 2023   Fri Jul 21 16:12:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> {code}
> UPDATE with Root cause:
> After checking first, in the thread dump of the leader SCM, we saw these 
> threads that are interesting:
> {code}
> "main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M 
> defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable  
> [0x00007f516b4e9000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.EPoll.wait([email protected]/Native Method)
>         at 
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
>         at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
>         - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
>         - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
>         at 
> sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
>         - locked <0x00000007058d3a20> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
>         - locked <0x00000007058d3a20> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1506)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1459)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>         at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown 
> Source)
>         at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>         - locked <0x000000072d460c08> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>         at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>         at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
>         at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
>         - locked <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
>         - locked <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>         at picocli.CommandLine.access$1300(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>         at picocli.CommandLine.execute(CommandLine.java:2078)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
> {code}
> {code}
> "IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms 
> elapsed=206369.56s allocated=11121K defined_classes=216 
> tid=0x00007f5166457800 nid=0x2082b5 waiting for monitor entry  
> [0x00007f511fba8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
>         - waiting to lock <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
>         - locked <0x000000072cf80778> (a 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
>  Source)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
>         at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at 
> java.security.AccessController.doPrivileged([email protected]/Native Method)
>         at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> {code}
> "IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms 
> elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800 
> nid=0x2082b6 waiting for monitor entry  [0x00007f511faa7000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
>         - waiting to lock <0x000000072cf80778> (a 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
>  Source)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
>         at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at 
> java.security.AccessController.doPrivileged([email protected]/Native Method)
>         at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> Looking at it first these are not trivially causing a dead-lock, but further 
> analysis of the environment shows that:
> - main thread has locked on SCMCertificateClient instance as part of the 
> listCA call. Lock object id: 0x000000072a05b650
> - The server handler #0 for port 9961 waits on locking on 
> SCMCertificateClient as part of the getRootCACertificate while it has already 
> locked SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is 
> 0x000000072cf80778
> - The server handler #1 for port 9961 waits on locking on 
> SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is 
> 0x000000072cf80778
> with that we have depleted the handler threads for the 
> SCMSecurityProtocolServer (by default we have 2), so main thread can not 
> submit the request, and as the retry policy is infinite retry, it waits for 
> some time in between retries, and that is the EPoll wait on which the main 
> thread is waiting in RUNNABLE state, while the other two threads are BLOCKED.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to