[
https://issues.apache.org/jira/browse/HDDS-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703859#comment-17703859
]
Neil Joshi commented on HDDS-7985:
----------------------------------
This problem appears to be due to duplicate SCM certs stored in the db for the
SCM recovered. With the scm recovery from failed disk including scm
decommissioning, the issue can be resolved with proper SCM shutdown prior to
recovery and restart.
Once an SCM is offline for maintenance, a mechanism is needed to remove the scm
certificate from the database. For this ticket I am proposing to include an
ozone admin certs cli command to remove scm certificates that are under
maintenance.
An administrative command similar to,
{code:java}
ozone admin cert remove <scmid>
{code}
[~nanda619] , thoughts?
This problem for datanodes detecting dup certs on start may also exist during
certificate rotation, as renewed and rotation certs will be added prior to the
old expiring certificates resulting in a period of time when multiple certs for
the same scm exists. Should a datanode be brought into service during the
period when multiple certs exist for the same scm, the HAUtils.waitForCACerts
will raise an exception due to number of certs found > num expected. Is this
a potential problem with certificate rotation and the cert check on datanode
start [~pifta] ?
https://github.com/apache/ozone/blob/3f5a80783a100716f628e12d7365f609d807173e/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/HAUtils.java#L459
> [SCM HA] On SCM Disk failure recovery causes Datanode Failure on startup
> -------------------------------------------------------------------------
>
> Key: HDDS-7985
> URL: https://issues.apache.org/jira/browse/HDDS-7985
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Neil Joshi
> Priority: Major
>
> Recovery from an SCM disk failure when no backup is avail requires,
> * Clean _ozone.scm.db.dirs_ __ and __ _ozone.metadata.dirs_ locations
> and bootstrapping the SCM. Whether SCM is primodial or not an error occurs
> when recovering from a failed disk with no backup when starting a datanode
> after SCM recovery.
>
> Datanodes brought up after SCM disk failure recovery are unable to start due
> to a CA certificate error observed, stating the number of certificates
> received from the SCM is greater than the number expected:
> {code:java}
> ozonesecure-ha-datanode1-1 | 2023-02-17 00:46:40 INFO HAUtils:457 -
> Expected CA list size 4, where as received CA List size 5.{code}
> In this case when listing the certificates stored by the SCM, it reports a
> total of 5 scm certificates after SCM2 recovers from disk failure:
>
> {code:java}
> [email protected]
> [email protected]
> [email protected]
> [email protected]
> [email protected]
>
> {code}
> It appears to have 2 entries for SCM 2 (the scm disk failure recovery node)
>
> $ ozone admin certs list
> bash-4.2$ ozone admin cert list
> {code:java}
> Total 12 valid certificates:
> SerialNumber Valid From Expiry
> Subject
>
> 1 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]
> 10760186198072 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]
> 10779888473070 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=recon@recon
> 10780166036417 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f99f1a81-7cce-44c9-a09b-9f7bbc48b6ac, [email protected]
> 10788394717480 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=598be6bc-7d86-4cab-84dc-668a162a7ec2, [email protected]
> 10800769855768 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@bd3138308a3f
> 10801305457014 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@e4795cc77124
> 10801871334038 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@3eb28ff965a1
> 10803980992569 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om2
> 10804543987939 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om3
> 10806118720884 Fri Feb 17 00:00:00 UTC 2023 Sat Feb 17 00:00:00 UTC 2024
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om1
> 10932809284268 Fri Feb 17 00:00:00 UTC 2023 Mon Mar 27 00:00:00 UTC 2028
> O=CID-abb46225-77ba-4132-ac6e-96792b40450c,
> OU=b4a175f3-c6a4-47fd-bcc5-c081b03de8c7, [email protected] {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]