[ 
https://issues.apache.org/jira/browse/HDDS-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated HDDS-9061:
-----------------------------
    Attachment: quasar-kkgvze-2-scmjstack.log

> Possible dead-lock in SCM initialization due to certificate operations.
> -----------------------------------------------------------------------
>
>                 Key: HDDS-9061
>                 URL: https://issues.apache.org/jira/browse/HDDS-9061
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Soumitra Sulav
>            Assignee: István Fajth
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.4.0
>
>         Attachments: quasar-kkgvze-1-datanode.log, 
> quasar-kkgvze-1-ozone-scm.log, quasar-kkgvze-2-datanode.log, 
> quasar-kkgvze-2-ozone-scm.log, quasar-kkgvze-2-scmjstack.log, 
> quasar-kkgvze-6-ozone-scm.log
>
>
> Current config set under safety valve for ca cert rotation :
> {code:java}
> <property><name>hdds.x509.max.duration</name><value>P5D</value></property>
> <property><name>hdds.x509.default.duration</name><value>PT1H</value></property>
> <property><name>hdds.x509.renew.grace.duration</name><value>PT50M</value></property>
> <property><name>hdds.block.token.expiry.time</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.renew-interval</name><value>15m</value></property>
> <property><name>ozone.manager.delegation.token.max-lifetime</name><value>30m</value></property>
> <property><name>hdds.x509.ca.rotation.check.interval</name><value>PT30M</value></property>
> {code}
> based on above config the certificates should renew at 10th minutes from 
> start by the grace period coming into play.
> Observation :
> Certificates were renewed for a few roles or instances at random.
> Primordial SCM was stopped and started at the below times :
> 2023-07-21 15:36:03,703 : STOPPED
> 2023-07-21 16:00:25,736 : STARTED
> {code:java}
> [root@quasar-kkgvze-1 ~]# ozone admin cert list | grep "dn@"
> 66603508550073199 Fri Jul 21 15:02:32 UTC 2023   Fri Jul 21 16:02:32 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509087676398 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509333666132 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603509791941467 Fri Jul 21 15:02:33 UTC 2023   Fri Jul 21 16:02:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510052147817 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510180218825 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603510614053924 Fri Jul 21 15:02:34 UTC 2023   Fri Jul 21 16:02:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66603512588421508 Fri Jul 21 15:02:36 UTC 2023   Fri Jul 21 16:02:36 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604108035744075 Fri Jul 21 15:12:32 UTC 2023   Fri Jul 21 16:12:32 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109383514945 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109450607224 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604109820303351 Fri Jul 21 15:12:33 UTC 2023   Fri Jul 21 16:12:33 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> 66604110216800118 Fri Jul 21 15:12:34 UTC 2023   Fri Jul 21 16:12:34 UTC 2023 
>   
> [email protected],OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
>  
> CN=scm-sub-66603487342057...@quasar-kkgvze-1.quasar-kkgvze.root.hwx.site,OU=907e103a-3636-489a-b9fe-c23f697e1526,O=CID-e8bd6d19-a4ca-49af-8a6b-45c5cee57509
> {code}
> UPDATE with Root cause:
> After checking first, in the thread dump of the leader SCM, we saw these 
> threads that are interesting:
> {code}
> "main" #1 prio=5 os_prio=0 cpu=31356.46ms elapsed=206374.58s allocated=515M 
> defined_classes=8384 tid=0x00007f5164015800 nid=0x208099 runnable  
> [0x00007f516b4e9000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.EPoll.wait([email protected]/Native Method)
>         at 
> sun.nio.ch.EPollSelectorImpl.doSelect([email protected]/EPollSelectorImpl.java:120)
>         at 
> sun.nio.ch.SelectorImpl.lockAndDoSelect([email protected]/SelectorImpl.java:124)
>         - locked <0x0000000730c03f58> (a sun.nio.ch.Util$2)
>         - locked <0x000000072afbf058> (a sun.nio.ch.EPollSelectorImpl)
>         at 
> sun.nio.ch.SelectorImpl.select([email protected]/SelectorImpl.java:136)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:202)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
>         - locked <0x00000007058d3a20> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
>         - locked <0x00000007058d3a20> (a 
> org.apache.hadoop.ipc.Client$Connection)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1506)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1459)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>         at jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown 
> Source)
>         at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke([email protected]/DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke([email protected]/Method.java:566)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>         - locked <0x000000072d460c08> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>         at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>         at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
>         at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:959)
>         - locked <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:951)
>         - locked <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.persistSCMCertificates(StorageContainerManager.java:1557)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.start(StorageContainerManager.java:1515)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:172)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
>         at picocli.CommandLine.access$1300(CommandLine.java:145)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179)
>         at picocli.CommandLine.execute(CommandLine.java:2078)
>         at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100)
>         at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91)
>         at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63)
> {code}
> {code}
> "IPC Server handler 0 on 9961" #263 daemon prio=5 os_prio=0 cpu=3568.63ms 
> elapsed=206369.56s allocated=11121K defined_classes=216 
> tid=0x00007f5166457800 nid=0x2082b5 waiting for monitor entry  
> [0x00007f511fba8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getCACertificate(DefaultCertificateClient.java:370)
>         - waiting to lock <0x000000072a05b650> (a 
> org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:491)
>         - locked <0x000000072cf80778> (a 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
>  Source)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
>         at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at 
> java.security.AccessController.doPrivileged([email protected]/Native Method)
>         at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> {code}
> "IPC Server handler 1 on 9961" #264 daemon prio=5 os_prio=0 cpu=3501.59ms 
> elapsed=206369.56s allocated=1728K defined_classes=3 tid=0x00007f5166459800 
> nid=0x2082b6 waiting for monitor entry  [0x00007f511faa7000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer.getRootCACertificate(SCMSecurityProtocolServer.java:479)
>         - waiting to lock <0x000000072cf80778> (a 
> org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.setRootCAIfNeeded(SCMSecurityProtocolServerSideTranslatorPB.java:413)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getOMCertificate(SCMSecurityProtocolServerSideTranslatorPB.java:286)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:112)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB$$Lambda$795/0x00000008007eb440.apply(Unknown
>  Source)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:93)
>         at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:17451)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:994)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:922)
>         at 
> java.security.AccessController.doPrivileged([email protected]/Native Method)
>         at javax.security.auth.Subject.doAs([email protected]/Subject.java:423)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2899)
> {code}
> Looking at it first these are not trivially causing a dead-lock, but further 
> analysis of the environment shows that:
> - main thread has locked on SCMCertificateClient instance as part of the 
> listCA call. Lock object id: 0x000000072a05b650
> - The server handler #0 for port 9961 waits on locking on 
> SCMCertificateClient as part of the getRootCACertificate while it has already 
> locked SCMSecurityProtocolServer in setRootCAIfNeeded, lock object ID is 
> 0x000000072cf80778
> - The server handler #1 for port 9961 waits on locking on 
> SCMSecurityProtocolServer in the same setRootCAIfNeeded, lock object ID is 
> 0x000000072cf80778
> with that we have depleted the handler threads for the 
> SCMSecurityProtocolServer (by default we have 2), so main thread can not 
> submit the request, and as the retry policy is infinite retry, it waits for 
> some time in between retries, and that is the EPoll wait on which the main 
> thread is waiting in RUNNABLE state, while the other two threads are BLOCKED.
> This situation in our test environment came up during a cluster restart after 
> some SCM node decommission then a new SCM node added, and this restart was 
> taking time, during which service certificates entered the grace period that 
> marks their end of life, and restarting services started to renew their 
> certificates while the leader SCM started up, so with that within the leader 
> SCM, we got into the detailed state. This pretty much blocked the whole 
> startup, certificates could not be renewed, and with that the cluster became 
> unusable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to