Wei-Chiu Chuang created HDDS-14504:
--------------------------------------

             Summary: [Website v2] [Docs] [Administrator Guide] Replacing Recon 
Disks
                 Key: HDDS-14504
                 URL: https://issues.apache.org/jira/browse/HDDS-14504
             Project: Apache Ozone
          Issue Type: Sub-task
            Reporter: Wei-Chiu Chuang


[https://ozone-site-v2.staged.apache.org/docs/administrator-guide/operations/disk-replacement/recon]

 

Gemini cli suggests the following content writeup:

 

  ---

  Title: Replacing Recon Disks

  Audience: Cluster Administrators
  Prerequisites: Familiarity with Ozone services and Linux system 
administration.

  ---

  1. Overview

   * Purpose: This guide provides the straightforward procedure for replacing a 
failed disk on an Ozone Recon node.
   * Role of Recon: Recon is an auxiliary service that provides insights, 
visualization, and management for an Ozone cluster. It maintains a copy of
     metadata from the Ozone Manager (OM) and Storage Container Manager (SCM) 
to build its own database for analysis.
   * Impact of Recon Disk Failure: A failure of the Recon disk will cause the 
Recon service to stop functioning. However, because Recon is not in the
     critical path for data I/O, this failure has no impact on the core 
operations of your Ozone cluster. Client reads and writes will continue 
normally.
     All data on the Recon disk is fully rebuildable from OM and SCM.

  ---

  2. Pre-flight Checks

   1. Identify the Failed Disk: Use system tools (dmesg, smartctl, etc.) to 
confirm which disk has failed and its mount point.
   2. Identify Recon Directories: Check your ozone-site.xml to confirm which 
Recon directories are on the failed disk. The primary properties are:
       * ozone.recon.db.dir: Stores Recon's primary RocksDB database, which 
contains aggregated data and analysis results.
       * ozone.recon.om.db.dir: Stores the copy of the OM database snapshot 
that Recon uses as its source of truth for the namespace.
   3. Prepare the Replacement Disk: Physically install a new, healthy disk. 
Format it and mount it at the same path as the failed disk. Ensure it has the
      correct ownership and permissions for the user that runs the Recon 
process.

  ---

  3. Procedure for Replacing a Recon Disk

  This is a low-risk recovery procedure that can be performed without any 
downtime for your main Ozone cluster.

   1. STOP THE RECON SERVICE: On the Recon node, stop the Recon daemon. The 
rest of your Ozone cluster remains fully operational.

   2. Replace and Configure Disk: Physically replace the hardware. Mount the 
new, empty disk at the path(s) defined in ozone.recon.db.dir and
      ozone.recon.om.db.dir.

   3. RE-START THE RECON SERVICE: Simply start the Recon daemon again.

   4. MONITOR THE AUTOMATIC REBUILD PROCESS:
       * Upon starting with empty directories, Recon will automatically begin 
to rebuild its state.
       * Check the Recon log files (.log and .out). You will see messages 
indicating that it is connecting to the active OM and SCM.
       * OM Snapshot Download: Recon will request a full DB snapshot from the 
OM leader. This is the most time-consuming part of the process. Recon will
         download this snapshot and begin processing it to populate its own 
namespace database.
       * SCM Sync: Recon will also connect to the SCM to sync information about 
DataNodes, pipelines, and containers.

   5. VERIFY:
       * The initial data ingest and processing can take a significant amount 
of time, depending on the size of your cluster's metadata.
       * During this period, the Recon Web UI may be accessible but show 
incomplete or loading data.
       * Once the processing is complete, navigate the Recon UI to verify that 
the dashboard correctly displays cluster health, container information, and
         allows you to explore the namespace.

  ---

  4. Additional Considerations

   * No Data Loss Risk: This procedure involves no risk of data loss for your 
actual stored objects. All data on the Recon disk is secondary and
     rebuildable.
   * Performance Impact During Rebuild: The initial OM DB snapshot download and 
processing can be resource-intensive (CPU, network) on both the Recon node
     and the OM leader. If your cluster is under very heavy load, consider 
performing this during off-peak hours to minimize any potential performance
     impact on the OM.
   * Disk Monitoring: As with all services, actively monitoring disk health is 
a good practice to proactively manage hardware failures and avoid unexpected
     interruptions to the Recon service.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to