Wei-Chiu Chuang created HDDS-14503:
--------------------------------------

             Summary: [Website v2] [Docs] [Administrator Guide] Replacing 
Storage Container Manager Disks
                 Key: HDDS-14503
                 URL: https://issues.apache.org/jira/browse/HDDS-14503
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: documentation
            Reporter: Wei-Chiu Chuang


[https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/storage-container-manager]

 

if the disk containing SCM metadata directory (ozone.scm.db.dirs) needs to be 
replaced for whatever reason, the SCM metadata directory will need to be 
reconstructed by running ozone scm –bootstrap. (assuming SCM HA is configured)

 

---

Gemini cli suggest the following content writeup:

 

  ---

  Title: Replacing Storage Container Manager (SCM) Disks

  Audience: Cluster Administrators
  Prerequisites: Familiarity with Ozone cluster administration, especially SCM 
and its HA configuration.

  ---

  1. Overview

   * Purpose: This guide details the procedure for replacing a failed disk on 
an SCM node.
   * Impact of SCM Disk Failure: The SCM disk is critical, as it stores the 
RocksDB database containing the state of the entire cluster's physical storage,
     including:
       * DataNode registration and heartbeat status.
       * Pipeline information and states.
       * Container locations and replica information.
       * A failure of this disk without a proper recovery plan can render the 
cluster unable to manage storage or allocate new blocks.
   * Crucial Distinction: HA vs. Non-HA: The procedure depends entirely on 
whether your SCM is a single, standalone instance or part of a High-Availability
     (HA) Ratis-based quorum. Running a standalone SCM is a single point of 
failure and is not recommended for production environments.

  ---

  2. Pre-flight Checks

   1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to 
confirm which disk has failed and its mount point.
   2. Identify SCM Directories: Check your ozone-site.xml to confirm which 
Ozone directories are on the failed disk. The most important properties are:
       * ozone.scm.db.dirs: The primary SCM metadata database.
       * ozone.scm.ha.ratis.storage.dir: The location for SCM's internal HA 
Ratis logs (in an HA setup).
   3. Prepare the Replacement Disk: Physically install a new, healthy disk. 
Format it and mount it at the same path as the failed disk. Ensure it has the
      correct ownership and permissions for the user that runs the SCM process.

  ---

  3. Procedure for a Standalone (Non-HA) SCM

  This procedure is a critical disaster recovery event that requires full 
cluster downtime and a valid backup.

   1. STOP THE ENTIRE CLUSTER: Shut down all clients, DataNodes, OMs, and the 
SCM. Without a functional SCM, DataNodes cannot heartbeat and new block
      allocations will fail.
   2. Attempt Data Recovery: If possible, make a best-effort attempt to copy 
the contents of the ozone.scm.db.dirs directory from the failing disk to a
      safe, temporary location.
   3. If Recovery Fails, Restore from Backup: If the SCM database is 
unrecoverable, you must restore it from your most recent backup. Without a 
backup, you
      risk permanent data loss or a lengthy, complex, and potentially 
incomplete state reconstruction from DataNode reports.
   4. Replace and Configure Disk: Physically replace the hardware and ensure 
the new, empty disk is mounted at the correct path defined in
      ozone.scm.db.dirs.
   5. Restore Metadata: Copy the recovered data (from Step 2) or the restored 
backup data (from Step 3) to the ozone.scm.db.dirs path on the new disk.
   6. Restart and Verify:
       * Start the SCM service first.
       * Once the SCM is fully initialized and running, start the OMs and then 
the DataNodes.
       * Check the SCM Web UI to confirm that DataNodes are heartbeating and 
that pipelines are healthy. Run client I/O tests to ensure the cluster is fully
         operational.

  ---

  4. Procedure for an HA (Ratis-based) SCM

  This is the recommended production procedure. It leverages the HA quorum for 
recovery, requires no cluster downtime, and is much safer.

   1. STOP THE FAILED SCM INSTANCE: On the node with the failed disk, stop only 
the SCM process. The other SCMs will continue to operate, and one of them
      will remain the leader, managing the cluster.
   2. Replace and Configure Disk: Physically replace the hardware. Mount the 
new, empty disk at the path(s) defined in ozone.scm.db.dirs and
      ozone.scm.ha.ratis.storage.dir. Ensure correct ownership and permissions.
   3. RE-INITIALIZE THE SCM VIA BOOTSTRAP: The failed SCM has lost its state 
and must rejoin the HA cluster by getting a full copy of the latest state from
      the current leader. This is done using the scm --bootstrap command.
   4. RUN BOOTSTRAP AND MONITOR:
       * On the repaired node, execute the bootstrap command: bin/ozone scm 
--bootstrap
       * This command will:
           1. Connect to the existing SCM HA ring.
           2. Trigger the current leader to create a database checkpoint (a 
snapshot).
           3. Securely download the snapshot and install it locally on the new 
disk.
           4. Start the SCM daemon, which will join the Ratis ring as a 
follower.
       * Monitor the console output of the bootstrap command and the SCM's log 
file (.log and .out). You will see messages related to downloading the
         snapshot and joining the ring.
   5. VERIFY:
       * Once the bootstrap is complete and the daemon is running, the SCM is a 
healthy follower in the quorum.
       * Check the SCM Web UI from any of the SCM nodes. The list of peers 
should now show all SCMs as healthy. The cluster is back at full redundancy.

  ---

  5. Additional Considerations

   * Primordial SCM Node: In an HA setup, the first SCM started with scm --init 
is the "primordial" node, which generates the cluster's unique ID. If the
     primordial node's disk fails, the recovery procedure is the same (scm 
--bootstrap). The cluster ID is preserved by the surviving SCMs and will be
     replicated to the repaired node during the bootstrap process.
   * Backups are Still Essential: Even in a robust HA configuration, 
maintaining regular, off-site backups of the SCM database is a critical best 
practice
     for recovering from catastrophic multi-node failures or logical data 
corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to