[
https://issues.apache.org/jira/browse/HDDS-14504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-14504:
----------------------------------
Labels: pull-request-available (was: )
> [Website v2] [Docs] [Administrator Guide] Replacing Recon Disks
> ---------------------------------------------------------------
>
> Key: HDDS-14504
> URL: https://issues.apache.org/jira/browse/HDDS-14504
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Wei-Chiu Chuang
> Assignee: Gargi Jaiswal
> Priority: Major
> Labels: pull-request-available
>
> [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/recon]
>
> Gemini cli suggests the following content writeup:
>
> ---
> Title: Replacing Recon Disks
> Audience: Cluster Administrators
> Prerequisites: Familiarity with Ozone services and Linux system
> administration.
> ---
> 1. Overview
> * Purpose: This guide provides the straightforward procedure for replacing
> a failed disk on an Ozone Recon node.
> * Role of Recon: Recon is an auxiliary service that provides insights,
> visualization, and management for an Ozone cluster. It maintains a copy of
> metadata from the Ozone Manager (OM) and Storage Container Manager (SCM)
> to build its own database for analysis.
> * Impact of Recon Disk Failure: A failure of the Recon disk will cause the
> Recon service to stop functioning. However, because Recon is not in the
> critical path for data I/O, this failure has no impact on the core
> operations of your Ozone cluster. Client reads and writes will continue
> normally.
> All data on the Recon disk is fully rebuildable from OM and SCM.
> ---
> 2. Pre-flight Checks
> 1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to
> confirm which disk has failed and its mount point.
> 2. Identify Recon Directories: Check your ozone-site.xml to confirm which
> Recon directories are on the failed disk. The primary properties are:
> * ozone.recon.db.dir: Stores Recon's primary RocksDB database, which
> contains aggregated data and analysis results.
> * ozone.recon.om.db.dir: Stores the copy of the OM database snapshot
> that Recon uses as its source of truth for the namespace.
> 3. Prepare the Replacement Disk: Physically install a new, healthy disk.
> Format it and mount it at the same path as the failed disk. Ensure it has the
> correct ownership and permissions for the user that runs the Recon
> process.
> ---
> 3. Procedure for Replacing a Recon Disk
> This is a low-risk recovery procedure that can be performed without any
> downtime for your main Ozone cluster.
> 1. STOP THE RECON SERVICE: On the Recon node, stop the Recon daemon. The
> rest of your Ozone cluster remains fully operational.
> 2. Replace and Configure Disk: Physically replace the hardware. Mount the
> new, empty disk at the path(s) defined in ozone.recon.db.dir and
> ozone.recon.om.db.dir.
> 3. RE-START THE RECON SERVICE: Simply start the Recon daemon again.
> 4. MONITOR THE AUTOMATIC REBUILD PROCESS:
> * Upon starting with empty directories, Recon will automatically begin
> to rebuild its state.
> * Check the Recon log files (.log and .out). You will see messages
> indicating that it is connecting to the active OM and SCM.
> * OM Snapshot Download: Recon will request a full DB snapshot from the
> OM leader. This is the most time-consuming part of the process. Recon will
> download this snapshot and begin processing it to populate its own
> namespace database.
> * SCM Sync: Recon will also connect to the SCM to sync information
> about DataNodes, pipelines, and containers.
> 5. VERIFY:
> * The initial data ingest and processing can take a significant amount
> of time, depending on the size of your cluster's metadata.
> * During this period, the Recon Web UI may be accessible but show
> incomplete or loading data.
> * Once the processing is complete, navigate the Recon UI to verify
> that the dashboard correctly displays cluster health, container information,
> and
> allows you to explore the namespace.
> ---
> 4. Additional Considerations
> * No Data Loss Risk: This procedure involves no risk of data loss for your
> actual stored objects. All data on the Recon disk is secondary and
> rebuildable.
> * Performance Impact During Rebuild: The initial OM DB snapshot download
> and processing can be resource-intensive (CPU, network) on both the Recon node
> and the OM leader. If your cluster is under very heavy load, consider
> performing this during off-peak hours to minimize any potential performance
> impact on the OM.
> * Disk Monitoring: As with all services, actively monitoring disk health
> is a good practice to proactively manage hardware failures and avoid
> unexpected
> interruptions to the Recon service.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]