[ 
https://issues.apache.org/jira/browse/HDDS-9420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773944#comment-17773944
 ] 

Sammi Chen commented on HDDS-9420:
----------------------------------

I believe the root cause of this issue is same as HDDS-9410.  I discussed this 
issue with [~sgal] yesterday.  In DefaultCertificateClient#getTrustChain, when 
a system is upgraded to a new version with certificate bundle is supported, it 
will go to the "else" branch. In "else" branch, it will call listCA(), which 
will try to connect to the SCM security server through network, while at the 
moment SCM security server is not started yet.

 
{code:java}
List<X509Certificate> chain = new ArrayList<>();
// certificate bundle case
if (path.getCertificates().size() > 1) {
  for (int i = 0; i < path.getCertificates().size(); i++) {
    chain.add((X509Certificate) path.getCertificates().get(i));
  }
} else {
  // case before certificate bundle is supported
  X509Certificate lastInsertedCert = getCertificate();
  chain.add(lastInsertedCert);
  List<X509Certificate> caCertList =
      OzoneSecurityUtil.convertToX509(listCA());
  Set<X509Certificate> rootCaCertList = getAllRootCaCerts();
  while (!rootCaCertList.isEmpty() &&
      !rootCaCertList.contains(lastInsertedCert)) {
    Optional<X509Certificate> issuerOpt =
        getIssuerForCert(lastInsertedCert, caCertList);
    if (issuerOpt.isPresent()) {
      X509Certificate issuer = issuerOpt.get();
      chain.add(issuer);
      lastInsertedCert = issuer;
    } else {
      throw new CertificateException("No issuer found for certificate: " +
          lastInsertedCert);
    }
  }
  //add root ca to the cert chain at the end
  chain.add(lastInsertedCert);
} {code}
cc [~sgal]  [~pifta] 

 



 Here is the cert directory of SCM before certificate bundle is supported

 
{code:java}
[root@ozn-noha1-5 systest]# tree 
/var/lib/hadoop-ozone/scm/ozone-metadata/scm/sub-ca/
/var/lib/hadoop-ozone/scm/ozone-metadata/scm/sub-ca/
|-- certs
|   |-- 6039223955717647.crt
|   |-- CA-1.crt
|   `-- certificate.crt
`-- keys
    |-- private.pem
    `-- public.pem {code}
 

 

I think there are two possible solutions to address this issue.

1.  Support a upgrade action to merge all certificate files under "sub-ca" 
directory into one file, to turn the non certificate bundle to certificate 
bundle case. Basically remove the "else" branch of 
DefaultCertificateClient#getTrustChain.

2. Revert the "else" branch logic to previous logic. It should work too. 
{code:java}
// certificate bundle case
if (path.getCertificates().size() > 1) {
  for (int i = 0; i < path.getCertificates().size(); i++) {
    chain.add((X509Certificate) path.getCertificates().get(i));
  }
} else {
  // case before certificate bundle is supported
  chain.add(getCertificate());
  X509Certificate cert = getCACertificate();
  if (cert != null) {
    chain.add(getCACertificate());
  }
  cert = getRootCACertificate();
  if (cert != null) {
    chain.add(cert);
  }
} {code}
cc [~pifta]

> Enabling GRPC encryption causes SCM startup failure.  
> ------------------------------------------------------
>
>                 Key: HDDS-9420
>                 URL: https://issues.apache.org/jira/browse/HDDS-9420
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sadanand Shenoy
>            Assignee: István Fajth
>            Priority: Major
>
> HDDS-8178 added a feature to support multiple sub CA certs in trust chain, In 
> SCM constructor if security is enabled and hdds.grpc.tls.enabled is true it 
> tries to load the keyStoresFactory
> {code:java}
> if (conf.isSecurityEnabled() && conf.isGrpcTlsEnabled()) {
>   KeyStoresFactory serverKeyFactory =
>       certificateClient.getServerKeyStoresFactory(); {code}
> This in turn calls loadKeyManager which tries to load the entire trust chain 
> {code:java}
> private X509ExtendedKeyManager loadKeyManager(CertificateClient caClient)
>     throws GeneralSecurityException, IOException {
>   PrivateKey privateKey = caClient.getPrivateKey();
>   List<X509Certificate> newCertList = caClient.getTrustChain(); {code}
> Loading the entire trust chain does a listCA call which is network call to 
> SCMSecurityProtocolServer
> {code:java}
> public List<String> updateCAList() throws IOException {
>   pemEncodedCACertsLock.lock();
>   try {
>     pemEncodedCACerts = getScmSecureClient().listCACertificate(); {code}
> All of this happens inside the StorageContainerManager constructor but the 
> services in SCM are started only after constructor is initialised and 
> scm.start() is called which means it is sending a request to security server 
> before it is even started thus leading to connection refused messages in SCM 
> startup like below,
> {code:java}
> 10:45:45.506 AM             INFO      SCMRatisServerImpl starting Raft server 
> for scm:7b4b7153-eb02-443b-b8f9-3b146931674c
> 10:45:47.563 AM             INFO      RetryInvocationHandler 
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> <HOSTNAME>/<IP> to <HOSTNAME>:9961 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy11.submitRequest over nodeId=node1,nodeAddress=<HOSTNAME>/<IP>:9961 
> after 1 failover attempts. Trying to failover after sleeping for 2000ms.
> 10:45:49.565 AM             INFO      RetryInvocationHandler 
> com.google.protobuf.ServiceException: java.net.ConnectException: Call From 
> <HOSTNAME>/<IP> to <HOSTNAME>:9961 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> $Proxy11.submitRequest over nodeId=node1,nodeAddress=<HOSTNAME>/<IP>:9961 
> after 2 failover attempts. Trying to failover after sleeping for 2000ms.
> (repeated) {code}
> StackTrace
> {code:java}
> java.net.ConnectException: Connection refused
>     at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>     at 
> java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
>     at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:205)
>     at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:586)
>     at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:730)
>     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:843)
>     at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:430)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:1681)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1506)
>     at org.apache.hadoop.ipc.Client.call(Client.java:1459)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>     at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>     at jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
>     at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>     at com.sun.proxy.$Proxy14.submitRequest(Unknown Source)
>     at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.submitRequest(SCMSecurityProtocolClientSideTranslatorPB.java:102)
>     at 
> org.apache.hadoop.hdds.protocolPB.SCMSecurityProtocolClientSideTranslatorPB.listCACertificate(SCMSecurityProtocolClientSideTranslatorPB.java:374)
>     at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.updateCAList(DefaultCertificateClient.java:933)
>     at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.listCA(DefaultCertificateClient.java:921)
>     at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getTrustChain(DefaultCertificateClient.java:410)
>     at 
> org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.loadKeyManager(ReloadingX509KeyManager.java:204)
>     at 
> org.apache.hadoop.hdds.security.ssl.ReloadingX509KeyManager.<init>(ReloadingX509KeyManager.java:85)
>     at 
> org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.createKeyManagers(PemFileBasedKeyStoresFactory.java:83)
>     at 
> org.apache.hadoop.hdds.security.ssl.PemFileBasedKeyStoresFactory.init(PemFileBasedKeyStoresFactory.java:104)
>     at 
> org.apache.hadoop.hdds.security.x509.keys.SecurityUtil.getServerKeyStoresFactory(SecurityUtil.java:103)
>     at 
> org.apache.hadoop.hdds.security.x509.certificate.client.DefaultCertificateClient.getServerKeyStoresFactory(DefaultCertificateClient.java:948)
>     at 
> org.apache.hadoop.hdds.scm.ha.HASecurityUtils.createSCMRatisTLSConfig(HASecurityUtils.java:345)
>     at 
> org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.<init>(SCMRatisServerImpl.java:109)
>     at 
> org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.<init>(SCMHAManagerImpl.java:97)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.initializeSystemManagers(StorageContainerManager.java:646)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.<init>(StorageContainerManager.java:400)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:597)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.createSCM(StorageContainerManager.java:609)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.start(StorageContainerManagerStarter.java:171)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.startScm(StorageContainerManagerStarter.java:145)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:74)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter.call(StorageContainerManagerStarter.java:48)
>     at picocli.CommandLine.executeUserObject(CommandLine.java:1953) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to