[
https://issues.apache.org/jira/browse/HDDS-15009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang reassigned HDDS-15009:
--------------------------------------
Assignee: Wei-Chiu Chuang
> [Docs] Administrator guide: Ozone Container Scanner
> ---------------------------------------------------
>
> Key: HDDS-15009
> URL: https://issues.apache.org/jira/browse/HDDS-15009
> Project: Apache Ozone
> Issue Type: Task
> Components: documentation
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Priority: Major
>
> Ozone Container Scanner: Administrator's Guide
> The Container Scanner is a critical background service running on Ozone
> Datanodes that ensures data integrity by proactively detecting "bit rot" or
> silent
> data corruption. This guide explains how it works, how it handles errors,
> and how to tune it for your cluster.
> 1. Overview
> Storage media can occasionally suffer from silent corruption where data
> becomes unreadable or modified without throwing an immediate I/O error. The
> Container Scanner periodically verifies all containers stored on a Datanode
> to identify such issues before the data is requested by a client.
> Why it is useful:
> * Proactive Detection: It identifies corrupt replicas early, allowing
> Ozone to recover data from healthy replicas while they are still available.
> * Data Reliability: By ensuring that all replicas are healthy, it
> maintains the desired replication factor and prevents data loss.
> * Automated Recovery: It integrates with the Storage Container Manager
> (SCM) to trigger automatic replication of corrupt containers.
> ---
> 2. Scanner Types
> The system employs three specialized scanners to balance thoroughness with
> system performance:
> * Background Metadata Scanner: A lightweight scanner that verifies the
> integrity of container metadata and the internal database. It runs in a single
> thread across all volumes on a Datanode.
> * Background Data Scanner: A more intensive scanner that reads all data
> within a container and verifies it against stored checksums. It runs one
> thread
> per volume and is heavily throttled.
> * On-Demand Scanner: Triggered automatically when a container is first
> opened or if corruption is suspected during normal operations.
> ---
> 3. Error Handling and Volume Health
> When a scanner detects corruption in a container, the following sequence
> occurs:
> 1. Marked UNHEALTHY: The Datanode marks the container as UNHEALTHY.
> 2. SCM Notification: The next heartbeat reports this state to SCM,
> triggering automatic re-replication from healthy copies.
> 3. Volume Scan: Detecting corruption may indicate a failing disk. The
> Datanode automatically triggers a Volume Scan on the underlying disk.
> The Volume Scanner
> The Volume Scanner (part of StorageVolumeChecker) checks the physical
> health of the disk:
> * Periodic Check: Runs every 60 minutes by default.
> * Mechanism: Performs small I/O tests (reads/writes) to verify disk
> responsiveness.
> * Failure: If the disk fails these tests, the entire volume is marked as
> FAILED, and all its containers are reported as lost to SCM for recovery.
> When does a Datanode shut down?
> A Datanode will shut itself down if the number of failed volumes exceeds
> the configured tolerance threshold. This prevents a "zombie" Datanode with no
> functional storage from remaining part of the cluster.
> By default, Ozone requires at least one functional volume of each type
> (Data, Metadata, and DB). If all volumes of a specific type fail, the Datanode
> triggers a fatal shutdown.
> ---
> 4. Configuration and Tuning
> Container Scanner Configurations
> Prefix: hdds.container.scrub.
>
> ┌─────────────────────────┬─────────┬─────────────────────────────────────────────────────┐
> │ Configuration Key │ Default │ Description
> │
>
> ├─────────────────────────┼─────────┼─────────────────────────────────────────────────────┤
> │ enabled │ true │ Enable/disable all container
> scanners. │
> │ metadata.scan.interval │ 3h │ Interval between metadata scans.
> │
> │ data.scan.interval │ 7d │ Minimum interval between full data
> scan iterations. │
> │ volume.bytes.per.second │ 5MB/s │ Bandwidth limit per volume for
> background scanning. │
> │ min.gap │ 15m │ Minimum time before re-scanning the
> same container. │
>
> └─────────────────────────┴─────────┴─────────────────────────────────────────────────────┘
> Datanode Volume Failure Configurations
> Prefix: hdds.datanode.
>
> ┌──────────────────────────────────────┬─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ Configuration Key │ Default │ Description
> │
>
> ├──────────────────────────────────────┼─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ failed.data.volumes.tolerated │ -1 │ Data volumes allowed to
> fail before shutdown. -1 means "unlimited" (but needs 1 healthy volume). │
> │ failed.metadata.volumes.tolerated │ -1 │ Metadata volumes allowed
> to fail before shutdown. │
> │ failed.db.volumes.tolerated │ -1 │ DB volumes allowed to
> fail before shutdown. │
> │ periodic.disk.check.interval.minutes │ 60 │ How often to run the
> background Volume Scanner. │
> │ disk.check.io.test.count │ 3 │ Number of I/O tests to
> determine a disk failure. │
> │ disk.check.timeout │ 10m │ Max time allowed for a
> single disk check. │
>
> └──────────────────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘
> Tuning Tips:
> * High-Density Nodes: If disks are very large, increase
> volume.bytes.per.second to ensure the 7-day data scan interval can be met.
> * Performance: If background scanning causes I/O wait on your
> applications, lower the bandwidth limit.
> * High Availability: Adjust the "tolerated" volume counts if you prefer
> Datanodes to fail fast and be replaced rather than running in a degraded
> state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]