[jira] [Commented] (HDDS-7985) [SCM HA] On SCM Disk failure recovery causes Datanode Failure on startup

Neil Joshi (Jira) Wed, 22 Mar 2023 21:14:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-7985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703859#comment-17703859
 ]


Neil Joshi commented on HDDS-7985:
----------------------------------

This problem appears to be due to duplicate SCM certs stored in the db for the 
SCM recovered.  With the scm recovery from failed disk including scm 
decommissioning, the issue can be resolved with proper SCM shutdown prior to 
recovery and restart.  

 

Once an SCM is offline for maintenance, a mechanism is needed to remove the scm 
certificate from the database.  For this ticket I am proposing to include an 
ozone admin certs cli command to remove scm certificates that are under 
maintenance.

An administrative command similar to, 
{code:java}
ozone admin cert remove <scmid>

{code}
[~nanda619] , thoughts?

 

This problem for datanodes detecting dup certs on start may also exist during 
certificate rotation, as  renewed and rotation certs will be added prior to the 
old expiring certificates resulting in a period of time when multiple certs for 
the same scm exists.  Should a datanode be brought into service during the 
period when multiple certs exist for the same scm, the HAUtils.waitForCACerts 
will raise an exception due to number of certs found > num expected.   Is this 
a potential problem with certificate rotation and the cert check on datanode 
start [~pifta] ?

https://github.com/apache/ozone/blob/3f5a80783a100716f628e12d7365f609d807173e/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/HAUtils.java#L459

 

> [SCM HA] On SCM Disk failure recovery causes Datanode Failure on startup 
> -------------------------------------------------------------------------
>
>                 Key: HDDS-7985
>                 URL: https://issues.apache.org/jira/browse/HDDS-7985
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Neil Joshi
>            Priority: Major
>
> Recovery from an SCM disk failure when no backup is avail requires,
>  * Clean _ozone.scm.db.dirs_ __ and __ _ozone.metadata.dirs_ locations
> and bootstrapping the SCM.  Whether SCM is primodial or not an error occurs 
> when recovering from a failed disk with no backup when starting a datanode 
> after SCM recovery. 
>  
> Datanodes brought up after SCM disk failure recovery are unable to start due 
> to a CA certificate error observed, stating the number of certificates 
> received from the SCM is greater than the number expected:
> {code:java}
> ozonesecure-ha-datanode1-1  | 2023-02-17 00:46:40 INFO  HAUtils:457 - 
> Expected CA list size 4, where as received CA List size 5.{code}
> In this case when listing the certificates stored by the SCM, it reports a 
> total of 5 scm certificates after SCM2 recovers from disk failure:
>  
> {code:java}
> [email protected]
> [email protected]
> [email protected]
> [email protected]
> [email protected]
>  
> {code}
> It appears to have 2 entries for SCM 2 (the scm disk failure recovery node)
>  
> $ ozone admin certs list
> bash-4.2$ ozone admin cert list
> {code:java}
> Total 12 valid certificates: 
> SerialNumber      Valid From                     Expiry                       
>   Subject                                                                     
>                                   
> 1                 Fri Feb 17 00:00:00 UTC 2023   Mon Mar 27 00:00:00 UTC 2028 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]          
> 10760186198072    Fri Feb 17 00:00:00 UTC 2023   Mon Mar 27 00:00:00 UTC 2028 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, [email protected]      
> 10779888473070    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=recon@recon           
> 10780166036417    Fri Feb 17 00:00:00 UTC 2023   Mon Mar 27 00:00:00 UTC 2028 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f99f1a81-7cce-44c9-a09b-9f7bbc48b6ac, [email protected]      
> 10788394717480    Fri Feb 17 00:00:00 UTC 2023   Mon Mar 27 00:00:00 UTC 2028 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=598be6bc-7d86-4cab-84dc-668a162a7ec2, [email protected]      
> 10800769855768    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@bd3138308a3f       
> 10801305457014    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@e4795cc77124       
> 10801871334038    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=dn@3eb28ff965a1       
> 10803980992569    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om2                   
> 10804543987939    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om3                   
> 10806118720884    Fri Feb 17 00:00:00 UTC 2023   Sat Feb 17 00:00:00 UTC 2024 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=f02b032a-7da0-4132-8a31-61c3d078e6cb, CN=om1                   
> 10932809284268    Fri Feb 17 00:00:00 UTC 2023   Mon Mar 27 00:00:00 UTC 2028 
>   O=CID-abb46225-77ba-4132-ac6e-96792b40450c, 
> OU=b4a175f3-c6a4-47fd-bcc5-c081b03de8c7, [email protected]      {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-7985) [SCM HA] On SCM Disk failure recovery causes Datanode Failure on startup

Reply via email to