[ 
https://issues.apache.org/jira/browse/HDDS-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDDS-15009:
-----------------------------------
    Status: Patch Available  (was: Open)

> [Docs] Administrator guide: Ozone Container Scanner
> ---------------------------------------------------
>
>                 Key: HDDS-15009
>                 URL: https://issues.apache.org/jira/browse/HDDS-15009
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: documentation
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>              Labels: pull-request-available
>
>   Ozone Container Scanner: Administrator's Guide
>   The Container Scanner is a critical background service running on Ozone 
> Datanodes that ensures data integrity by proactively detecting "bit rot" or 
> silent
>   data corruption. This guide explains how it works, how it handles errors, 
> and how to tune it for your cluster.
>   1. Overview
>   Storage media can occasionally suffer from silent corruption where data 
> becomes unreadable or modified without throwing an immediate I/O error. The
>   Container Scanner periodically verifies all containers stored on a Datanode 
> to identify such issues before the data is requested by a client.
>   Why it is useful:
>    * Proactive Detection: It identifies corrupt replicas early, allowing 
> Ozone to recover data from healthy replicas while they are still available.
>    * Data Reliability: By ensuring that all replicas are healthy, it 
> maintains the desired replication factor and prevents data loss.
>    * Automated Recovery: It integrates with the Storage Container Manager 
> (SCM) to trigger automatic replication of corrupt containers.
>   ---
>   2. Scanner Types
>   The system employs three specialized scanners to balance thoroughness with 
> system performance:
>    * Background Metadata Scanner: A lightweight scanner that verifies the 
> integrity of container metadata and the internal database. It runs in a single
>      thread across all volumes on a Datanode.
>    * Background Data Scanner: A more intensive scanner that reads all data 
> within a container and verifies it against stored checksums. It runs one 
> thread
>      per volume and is heavily throttled.
>    * On-Demand Scanner: Triggered automatically when a container is first 
> opened or if corruption is suspected during normal operations.
>   ---
>   3. Error Handling and Volume Health
>   When a scanner detects corruption in a container, the following sequence 
> occurs:
>    1. Marked UNHEALTHY: The Datanode marks the container as UNHEALTHY.
>    2. SCM Notification: The next heartbeat reports this state to SCM, 
> triggering automatic re-replication from healthy copies.
>    3. Volume Scan: Detecting corruption may indicate a failing disk. The 
> Datanode automatically triggers a Volume Scan on the underlying disk.
>   The Volume Scanner
>   The Volume Scanner (part of StorageVolumeChecker) checks the physical 
> health of the disk:
>    * Periodic Check: Runs every 60 minutes by default.
>    * Mechanism: Performs small I/O tests (reads/writes) to verify disk 
> responsiveness.
>    * Failure: If the disk fails these tests, the entire volume is marked as 
> FAILED, and all its containers are reported as lost to SCM for recovery.
>   When does a Datanode shut down?
>   A Datanode will shut itself down if the number of failed volumes exceeds 
> the configured tolerance threshold. This prevents a "zombie" Datanode with no
>   functional storage from remaining part of the cluster.
>   By default, Ozone requires at least one functional volume of each type 
> (Data, Metadata, and DB). If all volumes of a specific type fail, the Datanode
>   triggers a fatal shutdown.
>   ---
>   4. Configuration and Tuning
>   Container Scanner Configurations
>   Prefix: hdds.container.scrub.
>   
> ┌─────────────────────────┬─────────┬─────────────────────────────────────────────────────┐
>   │ Configuration Key       │ Default │ Description                           
>               │
>   
> ├─────────────────────────┼─────────┼─────────────────────────────────────────────────────┤
>   │ enabled                 │ true    │ Enable/disable all container 
> scanners.              │
>   │ metadata.scan.interval  │ 3h      │ Interval between metadata scans.      
>               │
>   │ data.scan.interval      │ 7d      │ Minimum interval between full data 
> scan iterations. │
>   │ volume.bytes.per.second │ 5MB/s   │ Bandwidth limit per volume for 
> background scanning. │
>   │ min.gap                 │ 15m     │ Minimum time before re-scanning the 
> same container. │
>   
> └─────────────────────────┴─────────┴─────────────────────────────────────────────────────┘
>   Datanode Volume Failure Configurations
>   Prefix: hdds.datanode.
>   
> ┌──────────────────────────────────────┬─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
>   │ Configuration Key                    │ Default │ Description              
>                                                                         │
>   
> ├──────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
>   │ failed.data.volumes.tolerated        │ -1      │ Data volumes allowed to 
> fail before shutdown. -1 means "unlimited" (but needs 1 healthy volume). │
>   │ failed.metadata.volumes.tolerated    │ -1      │ Metadata volumes allowed 
> to fail before shutdown.                                                │
>   │ failed.db.volumes.tolerated          │ -1      │ DB volumes allowed to 
> fail before shutdown.                                                      │
>   │ periodic.disk.check.interval.minutes │ 60      │ How often to run the 
> background Volume Scanner.                                                  │
>   │ disk.check.io.test.count             │ 3       │ Number of I/O tests to 
> determine a disk failure.                                                 │
>   │ disk.check.timeout                   │ 10m     │ Max time allowed for a 
> single disk check.                                                        │
>   
> └──────────────────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘
>   Tuning Tips:
>    * High-Density Nodes: If disks are very large, increase 
> volume.bytes.per.second to ensure the 7-day data scan interval can be met.
>    * Performance: If background scanning causes I/O wait on your 
> applications, lower the bandwidth limit.
>    * High Availability: Adjust the "tolerated" volume counts if you prefer 
> Datanodes to fail fast and be replaced rather than running in a degraded 
> state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to