This is an automated email from the ASF dual-hosted git repository.

weichiu pushed a commit to branch HDDS-9225-website-v2
in repository https://gitbox.apache.org/repos/asf/ozone-site.git


The following commit(s) were added to refs/heads/HDDS-9225-website-v2 by this 
push:
     new 83d681634 HDDS-14504. [Website v2] [Docs] [Administrator Guide] 
Replacing Recon Disks (#288)
83d681634 is described below

commit 83d68163472b4204e5220efcaa853edb1fc91b70
Author: Gargi Jaiswal <[email protected]>
AuthorDate: Wed Jan 28 23:16:04 2026 +0530

    HDDS-14504. [Website v2] [Docs] [Administrator Guide] Replacing Recon Disks 
(#288)
    
    Co-authored-by: Wei-Chiu Chuang <[email protected]>
---
 cspell.yaml                                        |   1 +
 .../03-operations/04-disk-replacement/04-recon.md  | 126 ++++++++++++++++++++-
 2 files changed, 126 insertions(+), 1 deletion(-)

diff --git a/cspell.yaml b/cspell.yaml
index efb4a35e9..15b22642e 100644
--- a/cspell.yaml
+++ b/cspell.yaml
@@ -75,6 +75,7 @@ words:
 - ratis
 - OM
 - SCM
+- Recon
 - datanode
 - datanodes
 - DN
diff --git 
a/docs/05-administrator-guide/03-operations/04-disk-replacement/04-recon.md 
b/docs/05-administrator-guide/03-operations/04-disk-replacement/04-recon.md
index 9597162c0..bc162ac24 100644
--- a/docs/05-administrator-guide/03-operations/04-disk-replacement/04-recon.md
+++ b/docs/05-administrator-guide/03-operations/04-disk-replacement/04-recon.md
@@ -4,4 +4,128 @@ sidebar_label: Recon
 
 # Replacing Recon Disks
 
-**TODO:** File a subtask under 
[HDDS-9859](https://issues.apache.org/jira/browse/HDDS-9859) and complete this 
page or section.
+**Audience:** Cluster Administrators
+
+**Prerequisites:** Familiarity with Ozone services and Linux system 
administration.
+
+---
+
+## 1. Overview
+
+- **Purpose:**
+  This guide provides the straightforward procedure for replacing a failed 
disk on an Ozone Recon node.
+
+- **Impact of Recon Disk Failure:**
+  A failure of the Recon disk will cause the Recon service to stop 
functioning. However, because Recon is not in the critical path for data I/O, 
this failure has **no impact on the core operations of your Ozone cluster**. 
Client reads and writes will continue normally. All data on the Recon disk can 
be fully rebuilt from OM and SCM.
+
+:::note
+Unlike critical services like OM or SCM, a Recon disk failure does **not 
impact core cluster operations**. Client reads and writes continue normally 
because Recon is not in the data I/O path. All data stored on the Recon disk 
can be fully rebuilt from the active OM and SCM services, making disk 
replacement a straightforward, low-risk procedure that can be performed without 
cluster downtime.
+:::
+
+When a Recon disk fails, the service will stop functioning, but upon restart 
with empty directories, Recon automatically detects the missing databases and 
initiates a complete rebuild by downloading fresh snapshots from the OM leader 
and syncing with SCM. This automatic recovery process ensures that all Recon 
databases—including the OM snapshot database, SCM snapshot database (if 
enabled), and Recon's own aggregated analysis databases—are fully reconstructed 
without manual intervention.
+
+### Recon Database Directories
+
+Recon uses several database directories that may be affected by disk failure:
+
+- **`ozone.recon.db.dir`**: Stores Recon's primary RocksDB database, which 
contains aggregated data and analysis results (ContainerKey and 
ContainerKeyCount tables). This directory also typically contains the SQL 
database (default Derby) used for storing GlobalStats, FileCountBySize, 
ReconTaskStatus, ContainerHistory, and UnhealthyContainers tables.
+- **`ozone.recon.om.db.dir`**: Stores the copy of the OM database snapshot 
that Recon uses as its source of truth for the namespace.
+- **`ozone.recon.scm.db.dirs`**: Stores the copy of the SCM database snapshot 
(if SCM snapshot is enabled via `ozone.recon.scm.snapshot.enabled`). This 
contains information about Datanodes, pipelines, and containers.
+
+If any of these directories are on the failed disk, they will need to be 
restored to the replacement disk.
+
+---
+
+## 2. Pre-flight Checks
+
+Before starting, the administrator should:
+
+1. **Identify the Failed Disk:** Use system tools (`dmesg`, `smartctl`, etc.) 
to confirm which disk has failed and its mount point.
+
+2. **Identify Recon Directories:** Check your `ozone-site.xml` to confirm 
which Recon directories are on the failed disk. The primary properties are:
+   - `ozone.recon.db.dir`: Stores Recon's primary RocksDB database, which 
contains aggregated data and analysis results.
+   - `ozone.recon.om.db.dir`: Stores the copy of the OM database snapshot that 
Recon uses as its source of truth for the namespace.
+   - `ozone.recon.scm.db.dirs`: Stores the copy of the SCM database snapshot 
(if SCM snapshot is enabled).
+   - `ozone.recon.sql.db.jdbc.url`: The JDBC URL for the SQL database 
(defaults to `jdbc:derby:${ozone.recon.db.dir}/ozone_recon_derby.db` if not 
explicitly configured).
+
+3. **Prepare the Replacement Disk:** Physically install a new, healthy disk. 
Format it and mount it at the same path as the failed disk. Ensure it has the 
correct ownership and permissions for the user that runs the Recon process. The 
default permissions for Recon metadata directories are **750** (configurable 
via `ozone.recon.db.dirs.permissions`).
+
+---
+
+## 3. Procedure for Replacing a Recon Disk
+
+This is a low-risk recovery procedure that can be performed without any 
downtime for your main Ozone cluster.
+
+1. **STOP THE Recon SERVICE:** On the Recon node, stop the Recon daemon. The 
rest of your Ozone cluster remains fully operational.
+
+2. **Replace and Configure Disk:** Physically replace the hardware. Mount the 
new, empty disk at the path(s) defined in `ozone.recon.db.dir`, 
`ozone.recon.om.db.dir`, and `ozone.recon.scm.db.dirs` (if configured). Ensure 
the directories exist and have the correct ownership and permissions for the 
user that runs the Recon process.
+
+3. **RE-START THE Recon SERVICE:** Simply start the Recon daemon again.
+
+4. **MONITOR THE AUTOMATIC REBUILD PROCESS:**
+   - Upon starting with empty directories, Recon will automatically begin to 
rebuild its state.
+   - Check the Recon log files (`.log` and `.out`). You will see messages 
indicating that it is connecting to the active OM and SCM.
+
+   **OM Snapshot Download:**
+   - Recon will detect that its OM database is empty (by checking the sequence 
number, which will be 0 or negative).
+   - When the sequence number is less than or equal to 0, Recon automatically 
triggers a full snapshot download from the OM leader.
+   - This is the most time-consuming part of the process. Recon will download 
the snapshot as a tar file, extract it, and begin processing it to populate its 
own namespace database.
+   - You will see log messages such as:
+
+     ```shell
+     Seq number of Recon's OM DB : 0
+     Fetching full snapshot from Ozone Manager
+     Obtaining full snapshot from Ozone Manager
+     ```
+
+   - The snapshot download happens via HTTP from the OM leader, and the 
extracted snapshot is stored in the `ozone.recon.om.db.dir` directory.
+
+   **SCM Sync:**
+   - Recon will also connect to the SCM to sync information about Datanodes, 
pipelines, and containers.
+   - If SCM snapshot is enabled (`ozone.recon.scm.snapshot.enabled=true`, 
which is the default), Recon will initialize or download the SCM database 
snapshot.
+   - If SCM snapshot is disabled, Recon will initialize pipeline information 
directly from SCM via RPC calls.
+   - The SCM sync happens periodically (default interval: 24 hours) and also 
during initial startup.
+
+   **Recon Database Rebuild:**
+   - Once the OM snapshot is downloaded and processed, Recon's task framework 
will automatically process the metadata to rebuild:
+     - ContainerKey and ContainerKeyCount tables (stored in RocksDB at 
`ozone.recon.db.dir`)
+     - GlobalStats, FileCountBySize, and other SQL tables (stored in the SQL 
database)
+     - Namespace summary information
+   - This processing happens asynchronously and may take additional time 
depending on the size of your cluster's metadata.
+
+5. **VERIFY:**
+   - The initial data ingest and processing can take a significant amount of 
time, depending on the size of your cluster's metadata.
+   - During this period, the Recon Web UI may be accessible but show 
incomplete or loading data.
+   - Monitor the Recon logs for completion messages. Look for:
+     - Successful snapshot download and installation
+     - Task completion messages for various Recon tasks (NSSummaryTask, 
ContainerKeyMapperTask, etc.)
+     - Sequence number updates indicating sync progress
+   - Once the processing is complete, navigate the Recon UI to verify that the 
dashboard correctly displays cluster health, container information, and allows 
you to explore the namespace.
+   - You can also check the Recon metrics endpoint to verify that sync 
operations have completed successfully.
+
+---
+
+## 4. Additional Considerations
+
+### 4.1 No Data Loss Risk
+
+- This procedure involves no risk of data loss for your actual stored objects.
+- All data on the Recon disk is secondary and can be rebuilt.
+- The Recon service is designed to automatically recover from empty or missing 
databases by fetching fresh snapshots from OM and SCM.
+
+### 4.2 Performance Impact During Rebuild
+
+- The initial OM DB snapshot download and processing can be resource-intensive 
(CPU, network, disk I/O) on both the Recon node and the OM leader.
+- If your cluster is under very heavy load, consider performing this during 
off-peak hours to minimize any potential performance impact on the OM.
+- The default initial delay before the first sync is 1 minute (configurable 
via `ozone.recon.om.snapshot.task.initial.delay`).
+
+### 4.3 SCM Snapshot Configuration
+
+- By default, SCM snapshot is enabled 
(`ozone.recon.scm.snapshot.enabled=true`).
+- If you have disabled SCM snapshots, Recon will still sync container and 
pipeline information from SCM, but it will do so via RPC calls rather than 
downloading a database snapshot.
+- The recovery procedure remains the same in both cases.
+
+### 4.4 Disk Monitoring
+
+- As with all services, actively monitoring disk health is a good practice to 
proactively manage hardware failures and avoid unexpected interruptions to the 
Recon service.
+- Consider setting up monitoring alerts for disk space usage and disk health 
metrics.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to