[
https://issues.apache.org/jira/browse/HDDS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
István Fajth updated HDDS-5078:
-------------------------------
Description:
On a Cloudera Manager managed cluster, scm is started always with --init option
specified, and this behaviour revealed the following null pointer dereference:
StorageContainerManager#initializeCertificateClient initializes the
scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized()
evaluates to true. This evaluates to true, if the VERSION file contains
primaryScmNodeId with a value.
If you upgrade an existing cluster with a single SCM to this code, the VERSION
file does not contain a primaryScmNodeId, so the scmCertificateClient remains
null.
Later the initialization code calls the
StorageContainerManager#initializeCAnSecurityProtocol method, which at the end
creates the securityProtocolServer, for the constructor call the rootCACert is
provided by calling the scmCertificateClient#getCACertificate method, but this
is a null dereference as scmCertificateClient is null.
The scmCertificateClient being null, can cause problems later as well, as it is
used multiple times unconditionally.
Later on after working around this particular problem (by simply let the code
create the scmCertificateClient without conditions), it turned out that in the
StorageContainerManager#initializeCAnSecurityProtocol call the
scmCertificateServer and the rootCertificateServer instances are also remain
uninitialized, with that causing problems when an scm client tries to get the
root CA certificate from the SCM.
For me this suggests that initialization of SCM fails after an upgrade on an
old cluster, this was working fine before, and --init did not reinitialized
anything, but worked fine.
If I change Cloudera Manager behaviour to do not init the SCM when I start it,
I still get the same NPE as with --init from the SCM.
The exception I get in the SCM log is as follows, the command I issue is a
recommission of a formerly (before upgrade) decommissioned DN.
{code}
java.lang.NullPointerException
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
at
org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
{code}
was:
During SCM initialization, the following causes an NPE:
StorageContainerManager#initializeCertificateClient initializes the
scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized()
evaluates to true. This evaluates to true, if the VERSION file contains
primaryScmNodeId with a value.
If you upgrade an existing cluster with a single SCM to this code, the VERSION
file does not contain a primaryScmNodeId, so the scmCertificateClient remains
null.
Later the initialization code calls the
StorageContainerManager#initializeCAnSecurityProtocol method, which at the end
creates the securityProtocolServer, for the constructor call the rootCACert is
provided by calling the scmCertificateClient#getCACertificate method, but this
is a null dereference as scmCertificateClient is null.
The scmCertificateClient being null, can cause problems later as well, as it is
used multiple times unconditionally.
> NPE during secure SCM initialization with HA code updated to an already
> existing cluster
> ----------------------------------------------------------------------------------------
>
> Key: HDDS-5078
> URL: https://issues.apache.org/jira/browse/HDDS-5078
> Project: Apache Ozone
> Issue Type: Bug
> Components: SCM HA
> Reporter: István Fajth
> Priority: Blocker
>
> On a Cloudera Manager managed cluster, scm is started always with --init
> option specified, and this behaviour revealed the following null pointer
> dereference:
> StorageContainerManager#initializeCertificateClient initializes the
> scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized()
> evaluates to true. This evaluates to true, if the VERSION file contains
> primaryScmNodeId with a value.
> If you upgrade an existing cluster with a single SCM to this code, the
> VERSION file does not contain a primaryScmNodeId, so the scmCertificateClient
> remains null.
> Later the initialization code calls the
> StorageContainerManager#initializeCAnSecurityProtocol method, which at the
> end creates the securityProtocolServer, for the constructor call the
> rootCACert is provided by calling the scmCertificateClient#getCACertificate
> method, but this is a null dereference as scmCertificateClient is null.
> The scmCertificateClient being null, can cause problems later as well, as it
> is used multiple times unconditionally.
> Later on after working around this particular problem (by simply let the code
> create the scmCertificateClient without conditions), it turned out that in
> the StorageContainerManager#initializeCAnSecurityProtocol call the
> scmCertificateServer and the rootCertificateServer instances are also remain
> uninitialized, with that causing problems when an scm client tries to get the
> root CA certificate from the SCM.
> For me this suggests that initialization of SCM fails after an upgrade on an
> old cluster, this was working fine before, and --init did not reinitialized
> anything, but worked fine.
> If I change Cloudera Manager behaviour to do not init the SCM when I start
> it, I still get the same NPE as with --init from the SCM.
> The exception I get in the SCM log is as follows, the command I issue is a
> recommission of a formerly (before upgrade) decommissioned DN.
> {code}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
> at
> org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]