Wei-Chiu Chuang created HDDS-14504:
--------------------------------------
Summary: [Website v2] [Docs] [Administrator Guide] Replacing Recon
Disks
Key: HDDS-14504
URL: https://issues.apache.org/jira/browse/HDDS-14504
Project: Apache Ozone
Issue Type: Sub-task
Reporter: Wei-Chiu Chuang
[https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/recon]
Gemini cli suggests the following content writeup:
---
Title: Replacing Recon Disks
Audience: Cluster Administrators
Prerequisites: Familiarity with Ozone services and Linux system
administration.
---
1. Overview
* Purpose: This guide provides the straightforward procedure for replacing a
failed disk on an Ozone Recon node.
* Role of Recon: Recon is an auxiliary service that provides insights,
visualization, and management for an Ozone cluster. It maintains a copy of
metadata from the Ozone Manager (OM) and Storage Container Manager (SCM)
to build its own database for analysis.
* Impact of Recon Disk Failure: A failure of the Recon disk will cause the
Recon service to stop functioning. However, because Recon is not in the
critical path for data I/O, this failure has no impact on the core
operations of your Ozone cluster. Client reads and writes will continue
normally.
All data on the Recon disk is fully rebuildable from OM and SCM.
---
2. Pre-flight Checks
1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to
confirm which disk has failed and its mount point.
2. Identify Recon Directories: Check your ozone-site.xml to confirm which
Recon directories are on the failed disk. The primary properties are:
* ozone.recon.db.dir: Stores Recon's primary RocksDB database, which
contains aggregated data and analysis results.
* ozone.recon.om.db.dir: Stores the copy of the OM database snapshot
that Recon uses as its source of truth for the namespace.
3. Prepare the Replacement Disk: Physically install a new, healthy disk.
Format it and mount it at the same path as the failed disk. Ensure it has the
correct ownership and permissions for the user that runs the Recon
process.
---
3. Procedure for Replacing a Recon Disk
This is a low-risk recovery procedure that can be performed without any
downtime for your main Ozone cluster.
1. STOP THE RECON SERVICE: On the Recon node, stop the Recon daemon. The
rest of your Ozone cluster remains fully operational.
2. Replace and Configure Disk: Physically replace the hardware. Mount the
new, empty disk at the path(s) defined in ozone.recon.db.dir and
ozone.recon.om.db.dir.
3. RE-START THE RECON SERVICE: Simply start the Recon daemon again.
4. MONITOR THE AUTOMATIC REBUILD PROCESS:
* Upon starting with empty directories, Recon will automatically begin
to rebuild its state.
* Check the Recon log files (.log and .out). You will see messages
indicating that it is connecting to the active OM and SCM.
* OM Snapshot Download: Recon will request a full DB snapshot from the
OM leader. This is the most time-consuming part of the process. Recon will
download this snapshot and begin processing it to populate its own
namespace database.
* SCM Sync: Recon will also connect to the SCM to sync information about
DataNodes, pipelines, and containers.
5. VERIFY:
* The initial data ingest and processing can take a significant amount
of time, depending on the size of your cluster's metadata.
* During this period, the Recon Web UI may be accessible but show
incomplete or loading data.
* Once the processing is complete, navigate the Recon UI to verify that
the dashboard correctly displays cluster health, container information, and
allows you to explore the namespace.
---
4. Additional Considerations
* No Data Loss Risk: This procedure involves no risk of data loss for your
actual stored objects. All data on the Recon disk is secondary and
rebuildable.
* Performance Impact During Rebuild: The initial OM DB snapshot download and
processing can be resource-intensive (CPU, network) on both the Recon node
and the OM leader. If your cluster is under very heavy load, consider
performing this during off-peak hours to minimize any potential performance
impact on the OM.
* Disk Monitoring: As with all services, actively monitoring disk health is
a good practice to proactively manage hardware failures and avoid unexpected
interruptions to the Recon service.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]