[ 
https://issues.apache.org/jira/browse/HDDS-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gargi Jaiswal reassigned HDDS-14503:
------------------------------------

    Assignee: Gargi Jaiswal

> [Website v2] [Docs] [Administrator Guide] Replacing Storage Container Manager 
> Disks
> -----------------------------------------------------------------------------------
>
>                 Key: HDDS-14503
>                 URL: https://issues.apache.org/jira/browse/HDDS-14503
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: documentation
>            Reporter: Wei-Chiu Chuang
>            Assignee: Gargi Jaiswal
>            Priority: Major
>
> [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/storage-container-manager]
>  
> if the disk containing SCM metadata directory (ozone.scm.db.dirs) needs to be 
> replaced for whatever reason, the SCM metadata directory will need to be 
> reconstructed by running ozone scm –bootstrap. (assuming SCM HA is configured)
>  
> ---
> Gemini cli suggest the following content writeup:
>  
>   ---
>   Title: Replacing Storage Container Manager (SCM) Disks
>   Audience: Cluster Administrators
>   Prerequisites: Familiarity with Ozone cluster administration, especially 
> SCM and its HA configuration.
>   ---
>   1. Overview
>    * Purpose: This guide details the procedure for replacing a failed disk on 
> an SCM node.
>    * Impact of SCM Disk Failure: The SCM disk is critical, as it stores the 
> RocksDB database containing the state of the entire cluster's physical 
> storage,
>      including:
>        * DataNode registration and heartbeat status.
>        * Pipeline information and states.
>        * Container locations and replica information.
>        * A failure of this disk without a proper recovery plan can render the 
> cluster unable to manage storage or allocate new blocks.
>    * Crucial Distinction: HA vs. Non-HA: The procedure depends entirely on 
> whether your SCM is a single, standalone instance or part of a 
> High-Availability
>      (HA) Ratis-based quorum. Running a standalone SCM is a single point of 
> failure and is not recommended for production environments.
>   ---
>   2. Pre-flight Checks
>    1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to 
> confirm which disk has failed and its mount point.
>    2. Identify SCM Directories: Check your ozone-site.xml to confirm which 
> Ozone directories are on the failed disk. The most important properties are:
>        * ozone.scm.db.dirs: The primary SCM metadata database.
>        * ozone.scm.ha.ratis.storage.dir: The location for SCM's internal HA 
> Ratis logs (in an HA setup).
>    3. Prepare the Replacement Disk: Physically install a new, healthy disk. 
> Format it and mount it at the same path as the failed disk. Ensure it has the
>       correct ownership and permissions for the user that runs the SCM 
> process.
>   ---
>   3. Procedure for a Standalone (Non-HA) SCM
>   This procedure is a critical disaster recovery event that requires full 
> cluster downtime and a valid backup.
>    1. STOP THE ENTIRE CLUSTER: Shut down all clients, DataNodes, OMs, and the 
> SCM. Without a functional SCM, DataNodes cannot heartbeat and new block
>       allocations will fail.
>    2. Attempt Data Recovery: If possible, make a best-effort attempt to copy 
> the contents of the ozone.scm.db.dirs directory from the failing disk to a
>       safe, temporary location.
>    3. If Recovery Fails, Restore from Backup: If the SCM database is 
> unrecoverable, you must restore it from your most recent backup. Without a 
> backup, you
>       risk permanent data loss or a lengthy, complex, and potentially 
> incomplete state reconstruction from DataNode reports.
>    4. Replace and Configure Disk: Physically replace the hardware and ensure 
> the new, empty disk is mounted at the correct path defined in
>       ozone.scm.db.dirs.
>    5. Restore Metadata: Copy the recovered data (from Step 2) or the restored 
> backup data (from Step 3) to the ozone.scm.db.dirs path on the new disk.
>    6. Restart and Verify:
>        * Start the SCM service first.
>        * Once the SCM is fully initialized and running, start the OMs and 
> then the DataNodes.
>        * Check the SCM Web UI to confirm that DataNodes are heartbeating and 
> that pipelines are healthy. Run client I/O tests to ensure the cluster is 
> fully
>          operational.
>   ---
>   4. Procedure for an HA (Ratis-based) SCM
>   This is the recommended production procedure. It leverages the HA quorum 
> for recovery, requires no cluster downtime, and is much safer.
>    1. STOP THE FAILED SCM INSTANCE: On the node with the failed disk, stop 
> only the SCM process. The other SCMs will continue to operate, and one of them
>       will remain the leader, managing the cluster.
>    2. Replace and Configure Disk: Physically replace the hardware. Mount the 
> new, empty disk at the path(s) defined in ozone.scm.db.dirs and
>       ozone.scm.ha.ratis.storage.dir. Ensure correct ownership and 
> permissions.
>    3. RE-INITIALIZE THE SCM VIA BOOTSTRAP: The failed SCM has lost its state 
> and must rejoin the HA cluster by getting a full copy of the latest state from
>       the current leader. This is done using the scm --bootstrap command.
>    4. RUN BOOTSTRAP AND MONITOR:
>        * On the repaired node, execute the bootstrap command: bin/ozone scm 
> --bootstrap
>        * This command will:
>            1. Connect to the existing SCM HA ring.
>            2. Trigger the current leader to create a database checkpoint (a 
> snapshot).
>            3. Securely download the snapshot and install it locally on the 
> new disk.
>            4. Start the SCM daemon, which will join the Ratis ring as a 
> follower.
>        * Monitor the console output of the bootstrap command and the SCM's 
> log file (.log and .out). You will see messages related to downloading the
>          snapshot and joining the ring.
>    5. VERIFY:
>        * Once the bootstrap is complete and the daemon is running, the SCM is 
> a healthy follower in the quorum.
>        * Check the SCM Web UI from any of the SCM nodes. The list of peers 
> should now show all SCMs as healthy. The cluster is back at full redundancy.
>   ---
>   5. Additional Considerations
>    * Primordial SCM Node: In an HA setup, the first SCM started with scm 
> --init is the "primordial" node, which generates the cluster's unique ID. If 
> the
>      primordial node's disk fails, the recovery procedure is the same (scm 
> --bootstrap). The cluster ID is preserved by the surviving SCMs and will be
>      replicated to the repaired node during the bootstrap process.
>    * Backups are Still Essential: Even in a robust HA configuration, 
> maintaining regular, off-site backups of the SCM database is a critical best 
> practice
>      for recovering from catastrophic multi-node failures or logical data 
> corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to