[ 
https://issues.apache.org/jira/browse/HDDS-14504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-14504:
----------------------------------
    Labels: pull-request-available  (was: )

> [Website v2] [Docs] [Administrator Guide] Replacing Recon Disks
> ---------------------------------------------------------------
>
>                 Key: HDDS-14504
>                 URL: https://issues.apache.org/jira/browse/HDDS-14504
>             Project: Apache Ozone
>          Issue Type: Sub-task
>            Reporter: Wei-Chiu Chuang
>            Assignee: Gargi Jaiswal
>            Priority: Major
>              Labels: pull-request-available
>
> [https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/recon]
>  
> Gemini cli suggests the following content writeup:
>  
>   ---
>   Title: Replacing Recon Disks
>   Audience: Cluster Administrators
>   Prerequisites: Familiarity with Ozone services and Linux system 
> administration.
>   ---
>   1. Overview
>    * Purpose: This guide provides the straightforward procedure for replacing 
> a failed disk on an Ozone Recon node.
>    * Role of Recon: Recon is an auxiliary service that provides insights, 
> visualization, and management for an Ozone cluster. It maintains a copy of
>      metadata from the Ozone Manager (OM) and Storage Container Manager (SCM) 
> to build its own database for analysis.
>    * Impact of Recon Disk Failure: A failure of the Recon disk will cause the 
> Recon service to stop functioning. However, because Recon is not in the
>      critical path for data I/O, this failure has no impact on the core 
> operations of your Ozone cluster. Client reads and writes will continue 
> normally.
>      All data on the Recon disk is fully rebuildable from OM and SCM.
>   ---
>   2. Pre-flight Checks
>    1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to 
> confirm which disk has failed and its mount point.
>    2. Identify Recon Directories: Check your ozone-site.xml to confirm which 
> Recon directories are on the failed disk. The primary properties are:
>        * ozone.recon.db.dir: Stores Recon's primary RocksDB database, which 
> contains aggregated data and analysis results.
>        * ozone.recon.om.db.dir: Stores the copy of the OM database snapshot 
> that Recon uses as its source of truth for the namespace.
>    3. Prepare the Replacement Disk: Physically install a new, healthy disk. 
> Format it and mount it at the same path as the failed disk. Ensure it has the
>       correct ownership and permissions for the user that runs the Recon 
> process.
>   ---
>   3. Procedure for Replacing a Recon Disk
>   This is a low-risk recovery procedure that can be performed without any 
> downtime for your main Ozone cluster.
>    1. STOP THE RECON SERVICE: On the Recon node, stop the Recon daemon. The 
> rest of your Ozone cluster remains fully operational.
>    2. Replace and Configure Disk: Physically replace the hardware. Mount the 
> new, empty disk at the path(s) defined in ozone.recon.db.dir and
>       ozone.recon.om.db.dir.
>    3. RE-START THE RECON SERVICE: Simply start the Recon daemon again.
>    4. MONITOR THE AUTOMATIC REBUILD PROCESS:
>        * Upon starting with empty directories, Recon will automatically begin 
> to rebuild its state.
>        * Check the Recon log files (.log and .out). You will see messages 
> indicating that it is connecting to the active OM and SCM.
>        * OM Snapshot Download: Recon will request a full DB snapshot from the 
> OM leader. This is the most time-consuming part of the process. Recon will
>          download this snapshot and begin processing it to populate its own 
> namespace database.
>        * SCM Sync: Recon will also connect to the SCM to sync information 
> about DataNodes, pipelines, and containers.
>    5. VERIFY:
>        * The initial data ingest and processing can take a significant amount 
> of time, depending on the size of your cluster's metadata.
>        * During this period, the Recon Web UI may be accessible but show 
> incomplete or loading data.
>        * Once the processing is complete, navigate the Recon UI to verify 
> that the dashboard correctly displays cluster health, container information, 
> and
>          allows you to explore the namespace.
>   ---
>   4. Additional Considerations
>    * No Data Loss Risk: This procedure involves no risk of data loss for your 
> actual stored objects. All data on the Recon disk is secondary and
>      rebuildable.
>    * Performance Impact During Rebuild: The initial OM DB snapshot download 
> and processing can be resource-intensive (CPU, network) on both the Recon node
>      and the OM leader. If your cluster is under very heavy load, consider 
> performing this during off-peak hours to minimize any potential performance
>      impact on the OM.
>    * Disk Monitoring: As with all services, actively monitoring disk health 
> is a good practice to proactively manage hardware failures and avoid 
> unexpected
>      interruptions to the Recon service.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to