[ 
https://issues.apache.org/jira/browse/HDDS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

István Fajth updated HDDS-5078:
-------------------------------
    Description: 
On a Cloudera Manager managed cluster, scm is started always with --init option 
specified, and this behaviour revealed the following null pointer dereference:
StorageContainerManager#initializeCertificateClient initializes the 
scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() 
evaluates to true. This evaluates to true, if the VERSION file contains 
primaryScmNodeId with a value.

If you upgrade an existing cluster with a single SCM to this code, the VERSION 
file does not contain a primaryScmNodeId, so the scmCertificateClient remains 
null.

Later the initialization code calls the 
StorageContainerManager#initializeCAnSecurityProtocol method, which at the end 
creates the securityProtocolServer, for the constructor call the rootCACert is 
provided by calling the scmCertificateClient#getCACertificate method, but this 
is a null dereference as scmCertificateClient is null.

The scmCertificateClient being null, can cause problems later as well, as it is 
used multiple times unconditionally.

Later on after working around this particular problem (by simply let the code 
create the scmCertificateClient without conditions), it turned out that in the 
StorageContainerManager#initializeCAnSecurityProtocol call the 
scmCertificateServer and the rootCertificateServer instances are also remain 
uninitialized, with that causing problems when an scm client tries to get the 
root CA certificate from the SCM.
For me this suggests that initialization of SCM fails after an upgrade on an 
old cluster, this was working fine before, and --init did not reinitialized 
anything, but worked fine.

If I change Cloudera Manager behaviour to do not init the SCM when I start it, 
I still get the same NPE as with --init from the SCM.
The exception I get in the SCM log is as follows, the command I issue is a 
recommission of a formerly (before upgrade) decommissioned DN.
{code}
java.lang.NullPointerException
        at 
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
        at 
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
{code}

  was:
During SCM initialization, the following causes an NPE:
StorageContainerManager#initializeCertificateClient initializes the 
scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() 
evaluates to true. This evaluates to true, if the VERSION file contains 
primaryScmNodeId with a value.

If you upgrade an existing cluster with a single SCM to this code, the VERSION 
file does not contain a primaryScmNodeId, so the scmCertificateClient remains 
null.

Later the initialization code calls the 
StorageContainerManager#initializeCAnSecurityProtocol method, which at the end 
creates the securityProtocolServer, for the constructor call the rootCACert is 
provided by calling the scmCertificateClient#getCACertificate method, but this 
is a null dereference as scmCertificateClient is null.

The scmCertificateClient being null, can cause problems later as well, as it is 
used multiple times unconditionally.


> NPE during secure SCM initialization with HA code updated to an already 
> existing cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: HDDS-5078
>                 URL: https://issues.apache.org/jira/browse/HDDS-5078
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: István Fajth
>            Priority: Blocker
>
> On a Cloudera Manager managed cluster, scm is started always with --init 
> option specified, and this behaviour revealed the following null pointer 
> dereference:
> StorageContainerManager#initializeCertificateClient initializes the 
> scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() 
> evaluates to true. This evaluates to true, if the VERSION file contains 
> primaryScmNodeId with a value.
> If you upgrade an existing cluster with a single SCM to this code, the 
> VERSION file does not contain a primaryScmNodeId, so the scmCertificateClient 
> remains null.
> Later the initialization code calls the 
> StorageContainerManager#initializeCAnSecurityProtocol method, which at the 
> end creates the securityProtocolServer, for the constructor call the 
> rootCACert is provided by calling the scmCertificateClient#getCACertificate 
> method, but this is a null dereference as scmCertificateClient is null.
> The scmCertificateClient being null, can cause problems later as well, as it 
> is used multiple times unconditionally.
> Later on after working around this particular problem (by simply let the code 
> create the scmCertificateClient without conditions), it turned out that in 
> the StorageContainerManager#initializeCAnSecurityProtocol call the 
> scmCertificateServer and the rootCertificateServer instances are also remain 
> uninitialized, with that causing problems when an scm client tries to get the 
> root CA certificate from the SCM.
> For me this suggests that initialization of SCM fails after an upgrade on an 
> old cluster, this was working fine before, and --init did not reinitialized 
> anything, but worked fine.
> If I change Cloudera Manager behaviour to do not init the SCM when I start 
> it, I still get the same NPE as with --init from the SCM.
> The exception I get in the SCM log is as follows, the command I issue is a 
> recommission of a formerly (before upgrade) decommissioned DN.
> {code}
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
>       at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
>       at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to