[ 
https://issues.apache.org/jira/browse/HDDS-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

István Fajth updated HDDS-10817:
--------------------------------
    Description: 
The exact commit is unclear that caused the problem, but there are two things 
that we have observed and which causes trouble.

One is [this 
condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
 which prevents the initialization of the certificate client, and leads to an 
NPE later on in 
[initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].

Also even if we resolve that NPE, the certificateClient remains uninitialized 
so later on other problems would arise when the system tries to access it, so 
the initial condition is about to be changed or fullfiled somehow during 
initialization or upgrade.

Once the certificate client is initialized, we start to see an other problem, 
now with SecretKeyManager, as it might miss its initialization due to [this 
check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
 where ratisEnabled evaluates to true if the config for 
{{ozone.scm.ratis.enable}} is set and uses the default value.

The problem is that the scmInit method overwrites the VERSION file and sets the 
SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set 
[here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
 So if one starts the only SCM by running start with --init after the upgrade, 
and then start without arguments, then this issue appears after fixing the 
VERSION file for the first issue.

So in order to prevent both issues (and preserve the idempotency of the --init 
startup option) we need two changes to happen on this upgrade to prevent these 
two issues.

  was:
The exact commit is unclear that caused the problem, but there are two things 
that we have observed and which causes trouble.

One is [this 
condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
 which prevents the initialization of the certificate client, and leads to an 
NPE later on in 
[initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].

Also even if we resolve that NPE, the certificateClient remains uninitialized 
so later on other problems would arise when the system tries to access it, so 
the initial condition is about to be changed or fullfiled somehow during 
initialization or upgrade.

Once the certificate client is initialized, we start to see an other problem, 
now with SecretKeyManager, as it might miss its initialization due to [this 
check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
 where ratisEnabled evaluates to true if the config for 
{ozone.scm.ratis.enable} is set and uses the default value.

The problem is that the scmInit method overwrites the VERSION file and sets the 
SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set 
[here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
 So if one starts the only SCM by running start with --init after the upgrade, 
and then start without arguments, then this issue appears after fixing the 
VERSION file for the first issue.

So in order to prevent both issues (and preserve the idempotency of the --init 
startup option) we need two changes to happen on this upgrade to prevent these 
two issues.


> Non-HA SCM node can not start after upgrading to 1.4, or current master
> -----------------------------------------------------------------------
>
>                 Key: HDDS-10817
>                 URL: https://issues.apache.org/jira/browse/HDDS-10817
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Major
>
> The exact commit is unclear that caused the problem, but there are two things 
> that we have observed and which causes trouble.
> One is [this 
> condition|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L590]
>  which prevents the initialization of the certificate client, and leads to an 
> NPE later on in 
> [initializeCAnSecurityProtocol|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L898].
> Also even if we resolve that NPE, the certificateClient remains uninitialized 
> so later on other problems would arise when the system tries to access it, so 
> the initial condition is about to be changed or fullfiled somehow during 
> initialization or upgrade.
> Once the certificate client is initialized, we start to see an other problem, 
> now with SecretKeyManager, as it might miss its initialization due to [this 
> check|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L418],
>  where ratisEnabled evaluates to true if the config for 
> {{ozone.scm.ratis.enable}} is set and uses the default value.
> The problem is that the scmInit method overwrites the VERSION file and sets 
> the SCM_HA flag to true if the VERSION file does not have the SCM_HA flag set 
> [here|https://github.com/apache/ozone/blob/master/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1347].
>  So if one starts the only SCM by running start with --init after the 
> upgrade, and then start without arguments, then this issue appears after 
> fixing the VERSION file for the first issue.
> So in order to prevent both issues (and preserve the idempotency of the 
> --init startup option) we need two changes to happen on this upgrade to 
> prevent these two issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to