[ 
https://issues.apache.org/jira/browse/HDDS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoyu Yao resolved HDDS-5078.
------------------------------
    Fix Version/s: 1.2.0
       Resolution: Fixed

Thanks [~bharat] for the contribution. PR has been merged. 

> [SCM HA Security] NPE during secure SCM initialization with HA code updated 
> to an already existing cluster
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-5078
>                 URL: https://issues.apache.org/jira/browse/HDDS-5078
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: SCM HA
>            Reporter: István Fajth
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>             Fix For: 1.2.0
>
>
> On a Cloudera Manager managed cluster, scm is started always with --init 
> option specified, and this behaviour revealed the following null pointer 
> dereference:
> StorageContainerManager#initializeCertificateClient initializes the 
> scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() 
> evaluates to true. This evaluates to true, if the VERSION file contains 
> primaryScmNodeId with a value.
> If you upgrade an existing cluster with a single SCM to this code, the 
> VERSION file does not contain a primaryScmNodeId, so the scmCertificateClient 
> remains null.
> Later the initialization code calls the 
> StorageContainerManager#initializeCAnSecurityProtocol method, which at the 
> end creates the securityProtocolServer, for the constructor call the 
> rootCACert is provided by calling the scmCertificateClient#getCACertificate 
> method, but this is a null dereference as scmCertificateClient is null.
> The scmCertificateClient being null, can cause problems later as well, as it 
> is used multiple times unconditionally.
> Later on after working around this particular problem (by simply let the code 
> create the scmCertificateClient without conditions), it turned out that in 
> the StorageContainerManager#initializeCAnSecurityProtocol call the 
> scmCertificateServer and the rootCertificateServer instances are also remain 
> uninitialized, with that causing problems when an scm client tries to get the 
> root CA certificate from the SCM.
> For me this suggests that initialization of SCM fails after an upgrade on an 
> old cluster, this was working fine before, and --init did not reinitialized 
> anything, but worked fine.
> If I change Cloudera Manager behaviour to do not init the SCM when I start 
> it, I still get the same NPE as with --init from the SCM.
> The exception I get in the SCM log is as follows, the command I issue is a 
> recommission of a formerly (before upgrade) decommissioned DN.
> {code}
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
>       at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>       at 
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
>       at 
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to