ChenSammi opened a new pull request, #5561: URL: https://github.com/apache/ozone/pull/5561
## What changes were proposed in this pull request? Resolve the backward compatibility issue introduced in HDDS-8588. The root cause is that the listCA() call during SCM, will try to call SCM's SCMSecurityProtocolServer API, but this SCMSecurityProtocolServer is not ready at that time. The call has a max retry policy. So SCM will stuck in the retry and cannot startup. The fix avoids the remote API call, use local on disk info to build the TrustChain. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-9420 ## How was this patch tested? Tested it manually. Here is the step 1. enable ozone security, ozone.security.enabled 2. enable grpc security, hdds.grpc.tls.enabled 3. Install a 1.3.0 OM cluster with above properties, do "scm --init", start scm, and then stop scm 4. upgrade the cluster to master branch, start scm, scm hang with following stack, stop scm ``` "main" #1 prio=5 os_prio=31 tid=0x0000000142009000 nid=0x2203 waiting on condition [0x000000016bf51000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108) - locked <0x00000005c48670c8> (a org.apache.hadoop.io.retry.RetryInvocationHandler$Call) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) at com.sun.proxy.$Proxy11.submitRequest(Unknown Source) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102) at org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:952) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:940) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getTrustChain(DefaultCertificateClient.java:420) - locked <0x00000005c107c2d8> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient) at org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.loadKeyManager(ReloadingX509KeyManager.java:204) at org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.<init>(ReloadingX509KeyManager.java:85) at org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.createKeyManagers(PemFileBasedKeyStoresFactory.java:83) at org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.init(PemFileBasedKeyStoresFactory.java:104) - locked <0x00000005c4698000> (a org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory) at org.apache.hadoop.hdds.security.x509.keys.SecurityUtil.getServerKeyStoresFactory(SecurityUtil.java:103) at org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getServerKeyStoresFactory(DefaultCertificateClient.java:967) - locked <0x00000005c107c2d8> (a org.apache.hadoop.hdds.security.x509.certificate.client.SCMCertificateClient) at org.apache.hadoop.hdds.scm.ha.HASecurityUtils.createSCMRatisTLSConfig(HASecurityUtils.java:341) at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.<init>(SCMRatisServerImpl.java:109) at org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.<init>(SCMHAManagerImpl.java:97) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:650) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:403) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:601) at org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:613) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:171) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48) at picocli.CommandLine.executeUserObject(CommandLine.java:1953) at picocli.CommandLine.access$1300(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352) at picocli.CommandLine$RunLast.handle(CommandLine.java:2346) at picocli.CommandLine$RunLast.handle(CommandLine.java:2311) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2179) at picocli.CommandLine.execute(CommandLine.java:2078) at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:100) at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:91) at org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.main(StorageContainerManagerStarter.java:63) ``` 6. upgrade the cluster to master with this patch, start scm successfully. There is message "Key manager is loaded with certificate chain" found in the SCM log. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
