[
https://issues.apache.org/jira/browse/HDDS-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
István Fajth updated HDDS-10817:
--------------------------------
Description:
The exact commit is unclear that caused the problem, but there are two things
that we have observed and which causes trouble.
One is [this
condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
which prevents the initialization of the certificate client, and leads to an
NPE later on in
[initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].
Also even if we resolve that NPE, the certificateClient remains uninitialized
so later on other problems would arise when the system tries to access it, so
the initial condition is about to be changed or fullfiled somehow during
initialization or upgrade.
Once the certificate client is initialized, we start to see an other problem,
now with SecretKeyManager, as it might miss its initialization due to [this
check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
where ratisEnabled evaluates to true if the config for
{{ozone.scm.ratis.enable}} is set and uses the default value.
The problem is that the scmInit method overwrites the VERSION file and sets the
SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set
[here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
So if one starts the only SCM by running start with --init after the upgrade,
and then start without arguments, then this issue appears after fixing the
VERSION file for the first issue.
So in order to prevent both issues (and preserve the idempotency of the --init
startup option) we need two changes to happen on this upgrade to prevent these
two issues.
was:
The exact commit is unclear that caused the problem, but there are two things
that we have observed and which causes trouble.
One is [this
condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
which prevents the initialization of the certificate client, and leads to an
NPE later on in
[initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].
Also even if we resolve that NPE, the certificateClient remains uninitialized
so later on other problems would arise when the system tries to access it, so
the initial condition is about to be changed or fullfiled somehow during
initialization or upgrade.
Once the certificate client is initialized, we start to see an other problem,
now with SecretKeyManager, as it might miss its initialization due to [this
check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
where ratisEnabled evaluates to true if the config for
{ozone.scm.ratis.enable} is set and uses the default value.
The problem is that the scmInit method overwrites the VERSION file and sets the
SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set
[here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
So if one starts the only SCM by running start with --init after the upgrade,
and then start without arguments, then this issue appears after fixing the
VERSION file for the first issue.
So in order to prevent both issues (and preserve the idempotency of the --init
startup option) we need two changes to happen on this upgrade to prevent these
two issues.
> Non-HA SCM node can not start after upgrading to 1.4, or current master
> -----------------------------------------------------------------------
>
> Key: HDDS-10817
> URL: https://issues.apache.org/jira/browse/HDDS-10817
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: István Fajth
> Assignee: István Fajth
> Priority: Major
>
> The exact commit is unclear that caused the problem, but there are two things
> that we have observed and which causes trouble.
> One is [this
> condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
> which prevents the initialization of the certificate client, and leads to an
> NPE later on in
> [initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].
> Also even if we resolve that NPE, the certificateClient remains uninitialized
> so later on other problems would arise when the system tries to access it, so
> the initial condition is about to be changed or fullfiled somehow during
> initialization or upgrade.
> Once the certificate client is initialized, we start to see an other problem,
> now with SecretKeyManager, as it might miss its initialization due to [this
> check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
> where ratisEnabled evaluates to true if the config for
> {{ozone.scm.ratis.enable}} is set and uses the default value.
> The problem is that the scmInit method overwrites the VERSION file and sets
> the SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set
> [here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
> So if one starts the only SCM by running start with --init after the
> upgrade, and then start without arguments, then this issue appears after
> fixing the VERSION file for the first issue.
> So in order to prevent both issues (and preserve the idempotency of the
> --init startup option) we need two changes to happen on this upgrade to
> prevent these two issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]