[
https://issues.apache.org/jira/browse/HDDS-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-14503:
----------------------------------
Labels: pull-request-available (was: )
> [Website v2] [Docs] [Administrator Guide] Replacing Storage Container Manager
> Disks
> -----------------------------------------------------------------------------------
>
> Key: HDDS-14503
> URL: https://issues.apache.org/jira/browse/HDDS-14503
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: documentation
> Reporter: Wei-Chiu Chuang
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
>
> [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/storage-container-manager]
>
> if the disk containing SCM metadata directory (ozone.scm.db.dirs) needs to be
> replaced for whatever reason, the SCM metadata directory will need to be
> reconstructed by running ozone scm –bootstrap. (assuming SCM HA is configured)
>
> ---
> Gemini cli suggest the following content writeup:
>
> ---
> Title: Replacing Storage Container Manager (SCM) Disks
> Audience: Cluster Administrators
> Prerequisites: Familiarity with Ozone cluster administration, especially
> SCM and its HA configuration.
> ---
> 1. Overview
> * Purpose: This guide details the procedure for replacing a failed disk on
> an SCM node.
> * Impact of SCM Disk Failure: The SCM disk is critical, as it stores the
> RocksDB database containing the state of the entire cluster's physical
> storage,
> including:
> * DataNode registration and heartbeat status.
> * Pipeline information and states.
> * Container locations and replica information.
> * A failure of this disk without a proper recovery plan can render the
> cluster unable to manage storage or allocate new blocks.
> * Crucial Distinction: HA vs. Non-HA: The procedure depends entirely on
> whether your SCM is a single, standalone instance or part of a
> High-Availability
> (HA) Ratis-based quorum. Running a standalone SCM is a single point of
> failure and is not recommended for production environments.
> ---
> 2. Pre-flight Checks
> 1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to
> confirm which disk has failed and its mount point.
> 2. Identify SCM Directories: Check your ozone-site.xml to confirm which
> Ozone directories are on the failed disk. The most important properties are:
> * ozone.scm.db.dirs: The primary SCM metadata database.
> * ozone.scm.ha.ratis.storage.dir: The location for SCM's internal HA
> Ratis logs (in an HA setup).
> 3. Prepare the Replacement Disk: Physically install a new, healthy disk.
> Format it and mount it at the same path as the failed disk. Ensure it has the
> correct ownership and permissions for the user that runs the SCM
> process.
> ---
> 3. Procedure for a Standalone (Non-HA) SCM
> This procedure is a critical disaster recovery event that requires full
> cluster downtime and a valid backup.
> 1. STOP THE ENTIRE CLUSTER: Shut down all clients, DataNodes, OMs, and the
> SCM. Without a functional SCM, DataNodes cannot heartbeat and new block
> allocations will fail.
> 2. Attempt Data Recovery: If possible, make a best-effort attempt to copy
> the contents of the ozone.scm.db.dirs directory from the failing disk to a
> safe, temporary location.
> 3. If Recovery Fails, Restore from Backup: If the SCM database is
> unrecoverable, you must restore it from your most recent backup. Without a
> backup, you
> risk permanent data loss or a lengthy, complex, and potentially
> incomplete state reconstruction from DataNode reports.
> 4. Replace and Configure Disk: Physically replace the hardware and ensure
> the new, empty disk is mounted at the correct path defined in
> ozone.scm.db.dirs.
> 5. Restore Metadata: Copy the recovered data (from Step 2) or the restored
> backup data (from Step 3) to the ozone.scm.db.dirs path on the new disk.
> 6. Restart and Verify:
> * Start the SCM service first.
> * Once the SCM is fully initialized and running, start the OMs and
> then the DataNodes.
> * Check the SCM Web UI to confirm that DataNodes are heartbeating and
> that pipelines are healthy. Run client I/O tests to ensure the cluster is
> fully
> operational.
> ---
> 4. Procedure for an HA (Ratis-based) SCM
> This is the recommended production procedure. It leverages the HA quorum
> for recovery, requires no cluster downtime, and is much safer.
> 1. STOP THE FAILED SCM INSTANCE: On the node with the failed disk, stop
> only the SCM process. The other SCMs will continue to operate, and one of them
> will remain the leader, managing the cluster.
> 2. Replace and Configure Disk: Physically replace the hardware. Mount the
> new, empty disk at the path(s) defined in ozone.scm.db.dirs and
> ozone.scm.ha.ratis.storage.dir. Ensure correct ownership and
> permissions.
> 3. RE-INITIALIZE THE SCM VIA BOOTSTRAP: The failed SCM has lost its state
> and must rejoin the HA cluster by getting a full copy of the latest state from
> the current leader. This is done using the scm --bootstrap command.
> 4. RUN BOOTSTRAP AND MONITOR:
> * On the repaired node, execute the bootstrap command: bin/ozone scm
> --bootstrap
> * This command will:
> 1. Connect to the existing SCM HA ring.
> 2. Trigger the current leader to create a database checkpoint (a
> snapshot).
> 3. Securely download the snapshot and install it locally on the
> new disk.
> 4. Start the SCM daemon, which will join the Ratis ring as a
> follower.
> * Monitor the console output of the bootstrap command and the SCM's
> log file (.log and .out). You will see messages related to downloading the
> snapshot and joining the ring.
> 5. VERIFY:
> * Once the bootstrap is complete and the daemon is running, the SCM is
> a healthy follower in the quorum.
> * Check the SCM Web UI from any of the SCM nodes. The list of peers
> should now show all SCMs as healthy. The cluster is back at full redundancy.
> ---
> 5. Additional Considerations
> * Primordial SCM Node: In an HA setup, the first SCM started with scm
> --init is the "primordial" node, which generates the cluster's unique ID. If
> the
> primordial node's disk fails, the recovery procedure is the same (scm
> --bootstrap). The cluster ID is preserved by the surviving SCMs and will be
> replicated to the repaired node during the bootstrap process.
> * Backups are Still Essential: Even in a robust HA configuration,
> maintaining regular, off-site backups of the SCM database is a critical best
> practice
> for recovering from catastrophic multi-node failures or logical data
> corruption.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]