jojochuang commented on code in PR #287:
URL: https://github.com/apache/ozone-site/pull/287#discussion_r2730480845


##########
docs/05-administrator-guide/03-operations/04-disk-replacement/02-storage-container-manager.md:
##########
@@ -4,4 +4,143 @@ sidebar_label: Storage Container Manager
 
 # Replacing Storage Container Manager Disks
 
-**TODO:** File a subtask under 
[HDDS-9859](https://issues.apache.org/jira/browse/HDDS-9859) and complete this 
page or section.
+**Audience:** Cluster Administrators
+
+**Prerequisites:** Familiarity with Ozone cluster administration, especially 
SCM and its HA configuration.
+
+---
+
+## 1. Overview
+
+When a disk containing the Storage Container Manager (SCM) metadata directory 
fails, proper recovery procedures are critical to maintain cluster availability 
and prevent data loss.
+
+If the disk containing SCM metadata directory (`ozone.scm.db.dirs`) needs to 
be replaced for whatever reason, the SCM metadata directory will need to be 
reconstructed by running `ozone scm --bootstrap` (assuming SCM HA is 
configured).
+
+- **Purpose :**
+This guide details the procedure for replacing a failed disk on an SCM node.
+
+- **Impact of SCM Disk Failure :**
+The SCM disk is critical, as it stores the RocksDB database containing the 
state of the entire cluster's physical storage, including:
+  - Datanode registration and heartbeat status.
+  - Pipeline information and states.
+  - Container locations and replica information.
+  - A failure of this disk without a proper recovery plan can render the 
cluster unable to manage storage or allocate new blocks.
+
+- **Crucial Distinction: HA vs. Non-HA :**
+The procedure depends entirely on whether your SCM is a single, standalone 
instance or part of a High-Availability (HA) Ratis-based quorum. Running a 
standalone SCM is a single point of failure and is not recommended for 
production environments.
+
+---
+
+## 2. Pre-flight Checks
+
+Before starting, the administrator should:
+
+1. **Identify the Failed Disk:** Use system tools (`dmesg`, `smartctl`, etc.) 
to confirm which disk has failed and its mount point.
+
+2. **Identify SCM Directories:** Check your `ozone-site.xml` to confirm which 
Ozone directories are on the failed disk. The most important properties are:
+   - `ozone.scm.db.dirs`: The primary SCM metadata database (RocksDB). This 
directory stores the entire cluster's block management metadata.
+   - `ozone.scm.ha.ratis.storage.dir`: The location for SCM's internal HA 
Ratis logs (in an HA setup). This directory stores Ratis metadata like logs. If 
not explicitly configured, it falls back to `ozone.metadata.dirs`. For 
production environments, it is recommended to configure this on a separate, 
fast disk (preferably SSD) for better performance.
+   - `ozone.scm.ha.ratis.snapshot.dir`: The directory where SCM stores 
snapshot tarballs downloaded from the leader during recovery. If not explicitly 
configured, it defaults to a component-specific location under 
`ozone.metadata.dirs`.
+
+3. **Prepare the Replacement Disk:** Physically install a new, healthy disk. 
Format it and mount it at the same path as the failed disk. Ensure it has the 
correct ownership and permissions for the user that runs the SCM process. The 
default permissions for SCM metadata directories are **750** (configurable via 
`ozone.scm.db.dirs.permissions`).
+
+---
+
+## 3. Procedure for a Standalone (Non-HA) SCM
+
+This procedure is a critical disaster recovery event that requires full 
cluster downtime and a valid backup.
+
+1. **STOP THE ENTIRE CLUSTER:** Shut down all clients, Datanodes, OMs, and the 
SCM. Without a functional SCM, Datanodes cannot heartbeat and new block 
allocations will fail.
+
+2. **Attempt Data Recovery:** If possible, make a best-effort attempt to copy 
the contents of the `ozone.scm.db.dirs` directory from the failing disk to a 
safe, temporary location.
+
+3. **If Recovery Fails, Restore from Backup:** If the SCM database is 
unrecoverable, you must restore it from your most recent backup. Without a 
backup, you risk permanent data loss or a lengthy, complex, and potentially 
incomplete state reconstruction from Datanode reports.

Review Comment:
   If the cluster has Recon, Recon has a copy of SCM snapshot that may be used 
here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to