[
https://issues.apache.org/jira/browse/HDDS-15008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang reassigned HDDS-15008:
--------------------------------------
Assignee: Wei-Chiu Chuang
> [Docs] Explain the automated certificate rotation mechanism in System
> Internals
> -------------------------------------------------------------------------------
>
> Key: HDDS-15008
> URL: https://issues.apache.org/jira/browse/HDDS-15008
> Project: Apache Ozone
> Issue Type: Task
> Components: documentation
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Major
> Labels: pull-request-available
>
> Add a page "Certificate Rotation" under System Internals -> Security:
>
> ✦ Certificate rotation in Apache Ozone is a multi-layered, automated process
> designed to ensure continuous security without service downtime. It follows a
> Research -> Discovery -> Execution lifecycle:
> 1. SCM Root CA Rotation (The Orchestrator)
> The Storage Container Manager (SCM) orchestrates the entire process via the
> RootCARotationManager.
> * Monitoring: It monitors the current Root CA certificate's lifetime. When
> it enters a configured "grace period" (usually 2 * renewal grace period before
> expiry), rotation is triggered.
> * New Root CA: SCM generates a new Root CA key pair and certificate.
> * Sub-CA Rotation (SCM HA): In HA mode, SCM uses Ratis to coordinate
> Sub-CA rotation across all nodes. Each SCM node generates a new Sub-CA key
> pair and
> gets its certificate signed by the new Root CA.
> * Post-Processing: After a successful rotation, SCM enters a
> "post-processing" state where it avoids signing new certificates for a short
> period to allow
> the new Root CA to propagate across the cluster.
> 2. Discovery (The Poller)
> Every component (Datanode, Ozone Manager, S3Gateway, Recon) runs a
> RootCaRotationPoller as part of its CertificateClient.
> * Polling: The poller periodically calls SCM's getAllRootCaCertificates
> API.
> * Detection: If a new Root CA is found that isn't in the component's
> "known" set, it triggers a forced renewal of the component's own certificate.
> 3. Component Certificate Rotation (The Execution)
> When a component (e.g., a Datanode) detects a new Root CA, it triggers the
> CertificateRenewerService:
> * Renewal: It generates a new key pair and sends a Certificate Signing
> Request (CSR) to SCM. SCM signs this using its new Sub-CA (which is chained
> to the
> new Root CA).
> * Atomic Swap: The component performs an atomic disk operation:
> 1. Moves the current keys and certificates to a backup directory.
> 2. Moves the newly generated keys and certificates from a "next"
> directory to the active directory.
> * Persistence: A callback (e.g., datanodeDetails.setCertSerialId) is
> executed to update the component's VERSION file, ensuring the new certificate
> ID is
> persisted across restarts.
> * Reload: The component reloads the new keys and certificates into its
> in-memory KeyManager and TrustManager, seamlessly updating TLS connections.
> Key Components
>
> ┌───────────────────────────┬───────────────────────────────────────────────────────────────────┐
> │ Class │ Role
> │
>
> ├───────────────────────────┼───────────────────────────────────────────────────────────────────┤
> │ RootCARotationManager │ Orchestrates Root CA and Sub-CA rotation in
> SCM. │
> │ RootCaRotationPoller │ Polls SCM for new Root CAs on components (DN,
> OM, etc.). │
> │ DefaultCertificateClient │ Base class for certificate management on all
> components. │
> │ CertificateRenewerService │ Handles the key generation, signing, and disk
> swap logic. │
> │ ClientTrustManager │ Manages trust anchors for clients during the
> rotation transition. │
>
> └───────────────────────────┴───────────────────────────────────────────────────────────────────┘
>
> --
> ✦ The Renewal Grace Period is a critical security parameter in Ozone that
> defines the time window during which a certificate should be renewed before it
> expires.
> Key Details:
> * Configuration Key: hdds.x509.renew.grace.duration
> * Default Value: P28D (28 days)
> * Format: ISO-8601 duration format (e.g., P28D for 28 days, PT1H for 1
> hour).
>
> ---
>
> the Root CA is initially created and hosted by the primordial SCM node.
>
> * Primordial SCM: The "First" SCM. It creates the cluster's first Root CA
> during the initial setup.
> * Leader SCM: The currently active Ratis leader. It manages the rotation
> of that Root CA when it approaches expiry.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]