[ 
https://issues.apache.org/jira/browse/HDDS-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDDS-15008:
--------------------------------------

    Assignee: Wei-Chiu Chuang

> [Docs] Explain the automated certificate rotation mechanism in System 
> Internals
> -------------------------------------------------------------------------------
>
>                 Key: HDDS-15008
>                 URL: https://issues.apache.org/jira/browse/HDDS-15008
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: documentation
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>              Labels: pull-request-available
>
> Add a page "Certificate Rotation" under System Internals -> Security:
>  
> ✦ Certificate rotation in Apache Ozone is a multi-layered, automated process 
> designed to ensure continuous security without service downtime. It follows a
>   Research -> Discovery -> Execution lifecycle:
>   1. SCM Root CA Rotation (The Orchestrator)
>   The Storage Container Manager (SCM) orchestrates the entire process via the 
> RootCARotationManager.
>    * Monitoring: It monitors the current Root CA certificate's lifetime. When 
> it enters a configured "grace period" (usually 2 * renewal grace period before
>      expiry), rotation is triggered.
>    * New Root CA: SCM generates a new Root CA key pair and certificate.
>    * Sub-CA Rotation (SCM HA): In HA mode, SCM uses Ratis to coordinate 
> Sub-CA rotation across all nodes. Each SCM node generates a new Sub-CA key 
> pair and
>      gets its certificate signed by the new Root CA.
>    * Post-Processing: After a successful rotation, SCM enters a 
> "post-processing" state where it avoids signing new certificates for a short 
> period to allow
>      the new Root CA to propagate across the cluster.
>   2. Discovery (The Poller)
>   Every component (Datanode, Ozone Manager, S3Gateway, Recon) runs a 
> RootCaRotationPoller as part of its CertificateClient.
>    * Polling: The poller periodically calls SCM's getAllRootCaCertificates 
> API.
>    * Detection: If a new Root CA is found that isn't in the component's 
> "known" set, it triggers a forced renewal of the component's own certificate.
>   3. Component Certificate Rotation (The Execution)
>   When a component (e.g., a Datanode) detects a new Root CA, it triggers the 
> CertificateRenewerService:
>    * Renewal: It generates a new key pair and sends a Certificate Signing 
> Request (CSR) to SCM. SCM signs this using its new Sub-CA (which is chained 
> to the
>      new Root CA).
>    * Atomic Swap: The component performs an atomic disk operation:
>        1. Moves the current keys and certificates to a backup directory.
>        2. Moves the newly generated keys and certificates from a "next" 
> directory to the active directory.
>    * Persistence: A callback (e.g., datanodeDetails.setCertSerialId) is 
> executed to update the component's VERSION file, ensuring the new certificate 
> ID is
>      persisted across restarts.
>    * Reload: The component reloads the new keys and certificates into its 
> in-memory KeyManager and TrustManager, seamlessly updating TLS connections.
>   Key Components
>   
> ┌───────────────────────────┬───────────────────────────────────────────────────────────────────┐
>   │ Class                     │ Role                                          
>                     │
>   
> ├───────────────────────────┼───────────────────────────────────────────────────────────────────┤
>   │ RootCARotationManager     │ Orchestrates Root CA and Sub-CA rotation in 
> SCM.                  │
>   │ RootCaRotationPoller      │ Polls SCM for new Root CAs on components (DN, 
> OM, etc.).          │
>   │ DefaultCertificateClient  │ Base class for certificate management on all 
> components.          │
>   │ CertificateRenewerService │ Handles the key generation, signing, and disk 
> swap logic.         │
>   │ ClientTrustManager        │ Manages trust anchors for clients during the 
> rotation transition. │
>   
> └───────────────────────────┴───────────────────────────────────────────────────────────────────┘
>  
> --
> ✦ The Renewal Grace Period is a critical security parameter in Ozone that 
> defines the time window during which a certificate should be renewed before it
>   expires.
>   Key Details:
>    * Configuration Key: hdds.x509.renew.grace.duration
>    * Default Value: P28D (28 days)
>    * Format: ISO-8601 duration format (e.g., P28D for 28 days, PT1H for 1 
> hour).
>  
> ---
>  
> the Root CA is initially created and hosted by the primordial SCM node.
>  
>    * Primordial SCM: The "First" SCM. It creates the cluster's first Root CA 
> during the initial setup.
>    * Leader SCM: The currently active Ratis leader. It manages the rotation 
> of that Root CA when it approaches expiry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to