errose28 commented on code in PR #387: URL: https://github.com/apache/ozone-site/pull/387#discussion_r3081361715
########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). Review Comment: We should not be pointing admins/operators at code. They should be able to run Ozone without it. Save this for the System Internals section when we start working on it. ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). + +## Error handling and volume health + +When a scanner finds corruption in a container: + +1. **Mark unhealthy** — The Datanode marks the container **UNHEALTHY**. +2. **Tell SCM** — The next **heartbeat** carries that state so SCM can treat the replica as bad and plan **re-replication** from good copies. +3. **Volume check** — Because corruption may indicate **failing media**, the Datanode can trigger a **volume-level health check** on the underlying disk. + +### Volume scanner (`StorageVolumeChecker`) + +The **volume scanner** (see [`StorageVolumeChecker.java`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java) in `apache/ozone`) probes **physical** volume health: + +- Runs on a fixed cadence (default **60 minutes** — `hdds.datanode.periodic.disk.check.interval.minutes`). +- Performs small **read/write** checks to see whether the volume responds reliably. +- If the volume fails, it is marked **FAILED**; containers on it are treated as **lost** to SCM so the cluster can recover elsewhere. + +Related keys: `hdds.datanode.disk.check.io.test.count`, `hdds.datanode.disk.check.io.failures.tolerated`, `hdds.datanode.disk.check.timeout`, `hdds.datanode.disk.check.min.gap` ([appendix](../appendix)). Setting `hdds.datanode.disk.check.io.test.count` to **0** disables disk I/O checks. + +### When a Datanode exits + +If **too many volumes fail**, the Datanode **stops** rather than staying in the cluster with no usable storage. Thresholds are per **volume category** (data, metadata, DB): + +- **`hdds.datanode.failed.data.volumes.tolerated`** +- **`hdds.datanode.failed.metadata.volumes.tolerated`** +- **`hdds.datanode.failed.db.volumes.tolerated`** + +Default **`-1`** means “no fixed cap” in that dimension, but Ozone still expects **at least one healthy volume of each type** the node uses. If **all** volumes of a required type fail, the Datanode treats that as **fatal** and shuts down. + +For Datanode volume and directory layout, see `hdds.datanode.dir`, `hdds.datanode.container.db.dir`, and related keys in the [configuration appendix](../appendix). + +## Configuration reference + +### Container scrub (`hdds.container.scrub.*`) + +| Key | Default | Description | +| --- | --- | --- | +| `hdds.container.scrub.enabled` | `true` | Master switch for container scanners. | +| `hdds.container.scrub.metadata.scan.interval` | `3h` | Time between **metadata** scan passes. | Review Comment: ```suggestion | `hdds.container.scrub.metadata.scan.interval` | `3h` | Minimum time between starting metadata scan passes. If a scan takes longer than this, the next scan will begin as soon as the current one finishes. | ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). + +## Error handling and volume health + +When a scanner finds corruption in a container: + +1. **Mark unhealthy** — The Datanode marks the container **UNHEALTHY**. +2. **Tell SCM** — The next **heartbeat** carries that state so SCM can treat the replica as bad and plan **re-replication** from good copies. +3. **Volume check** — Because corruption may indicate **failing media**, the Datanode can trigger a **volume-level health check** on the underlying disk. + +### Volume scanner (`StorageVolumeChecker`) + +The **volume scanner** (see [`StorageVolumeChecker.java`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java) in `apache/ozone`) probes **physical** volume health: + +- Runs on a fixed cadence (default **60 minutes** — `hdds.datanode.periodic.disk.check.interval.minutes`). +- Performs small **read/write** checks to see whether the volume responds reliably. +- If the volume fails, it is marked **FAILED**; containers on it are treated as **lost** to SCM so the cluster can recover elsewhere. + +Related keys: `hdds.datanode.disk.check.io.test.count`, `hdds.datanode.disk.check.io.failures.tolerated`, `hdds.datanode.disk.check.timeout`, `hdds.datanode.disk.check.min.gap` ([appendix](../appendix)). Setting `hdds.datanode.disk.check.io.test.count` to **0** disables disk I/O checks. + +### When a Datanode exits + +If **too many volumes fail**, the Datanode **stops** rather than staying in the cluster with no usable storage. Thresholds are per **volume category** (data, metadata, DB): + +- **`hdds.datanode.failed.data.volumes.tolerated`** +- **`hdds.datanode.failed.metadata.volumes.tolerated`** +- **`hdds.datanode.failed.db.volumes.tolerated`** + +Default **`-1`** means “no fixed cap” in that dimension, but Ozone still expects **at least one healthy volume of each type** the node uses. If **all** volumes of a required type fail, the Datanode treats that as **fatal** and shuts down. + +For Datanode volume and directory layout, see `hdds.datanode.dir`, `hdds.datanode.container.db.dir`, and related keys in the [configuration appendix](../appendix). + +## Configuration reference + +### Container scrub (`hdds.container.scrub.*`) + +| Key | Default | Description | +| --- | --- | --- | +| `hdds.container.scrub.enabled` | `true` | Master switch for container scanners. | +| `hdds.container.scrub.metadata.scan.interval` | `3h` | Time between **metadata** scan passes. | +| `hdds.container.scrub.data.scan.interval` | `7d` | Minimum time between **full data** scan **iterations** (if a pass finishes sooner, the scanner waits). | +| `hdds.container.scrub.volume.bytes.per.second` | `5242880` (~**5 MiB/s**) | Per-volume **bandwidth cap** for **background** data scanning. | +| `hdds.container.scrub.min.gap` | `15m` | Minimum time before the **same** container is scanned again. | + +### Datanode volume failure and disk checks (`hdds.datanode.*`) + +| Key | Default | Description | +| --- | --- | --- | +| `hdds.datanode.failed.data.volumes.tolerated` | `-1` | Data volumes that may fail before the Datanode stops (`-1` = unlimited count, but at least one good volume must remain). | +| `hdds.datanode.failed.metadata.volumes.tolerated` | `-1` | Same for **metadata** volumes. | +| `hdds.datanode.failed.db.volumes.tolerated` | `-1` | Same for **RocksDB** volumes. | +| `hdds.datanode.periodic.disk.check.interval.minutes` | `60` | Interval for **volume scanner** runs. | +| `hdds.datanode.disk.check.io.test.count` | `3` | Number of recent I/O tests used to judge disk health. | +| `hdds.datanode.disk.check.timeout` | `10m` | Max time for one disk check before the disk is considered failed. | + +## Tuning tips + +- **Large disks** — If a full data pass cannot finish within your `data.scan.interval` at the default throttle, **raise** `hdds.container.scrub.volume.bytes.per.second` cautiously. Review Comment: ```suggestion ::: note The background container data scanner can potentially take weeks to scan all container data on a volume. This rate is expected in order to reserve disk bandwidth for foreground workloads. ::: - **Large disks** — If a full data pass cannot finish within your desired `data.scan.interval` at the default throttle, **raise** `hdds.container.scrub.volume.bytes.per.second` cautiously to avoid taking too much disk IO from foreground workloads. ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | Review Comment: ```suggestion | **On-demand data scanner** | Runs metadata and data scans when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second`. | ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. Review Comment: I don't think we should mention dev config keys in this page at all so people are not tempted to mess with them. They will be in the global config appendix page where low level configs that are not essential for cluster operations belong. ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). + +## Error handling and volume health + +When a scanner finds corruption in a container: + +1. **Mark unhealthy** — The Datanode marks the container **UNHEALTHY**. +2. **Tell SCM** — The next **heartbeat** carries that state so SCM can treat the replica as bad and plan **re-replication** from good copies. +3. **Volume check** — Because corruption may indicate **failing media**, the Datanode can trigger a **volume-level health check** on the underlying disk. + +### Volume scanner (`StorageVolumeChecker`) + +The **volume scanner** (see [`StorageVolumeChecker.java`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java) in `apache/ozone`) probes **physical** volume health: + +- Runs on a fixed cadence (default **60 minutes** — `hdds.datanode.periodic.disk.check.interval.minutes`). +- Performs small **read/write** checks to see whether the volume responds reliably. +- If the volume fails, it is marked **FAILED**; containers on it are treated as **lost** to SCM so the cluster can recover elsewhere. + +Related keys: `hdds.datanode.disk.check.io.test.count`, `hdds.datanode.disk.check.io.failures.tolerated`, `hdds.datanode.disk.check.timeout`, `hdds.datanode.disk.check.min.gap` ([appendix](../appendix)). Setting `hdds.datanode.disk.check.io.test.count` to **0** disables disk I/O checks. + +### When a Datanode exits + +If **too many volumes fail**, the Datanode **stops** rather than staying in the cluster with no usable storage. Thresholds are per **volume category** (data, metadata, DB): + +- **`hdds.datanode.failed.data.volumes.tolerated`** +- **`hdds.datanode.failed.metadata.volumes.tolerated`** +- **`hdds.datanode.failed.db.volumes.tolerated`** + +Default **`-1`** means “no fixed cap” in that dimension, but Ozone still expects **at least one healthy volume of each type** the node uses. If **all** volumes of a required type fail, the Datanode treats that as **fatal** and shuts down. + +For Datanode volume and directory layout, see `hdds.datanode.dir`, `hdds.datanode.container.db.dir`, and related keys in the [configuration appendix](../appendix). + +## Configuration reference + +### Container scrub (`hdds.container.scrub.*`) + +| Key | Default | Description | +| --- | --- | --- | +| `hdds.container.scrub.enabled` | `true` | Master switch for container scanners. | +| `hdds.container.scrub.metadata.scan.interval` | `3h` | Time between **metadata** scan passes. | +| `hdds.container.scrub.data.scan.interval` | `7d` | Minimum time between **full data** scan **iterations** (if a pass finishes sooner, the scanner waits). | Review Comment: ```suggestion | `hdds.container.scrub.data.scan.interval` | `7d` | Minimum time between starting full container data scans of the same volume. If a scan takes longer than this, the next scan will begin as soon as the current one finishes. | ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). Review Comment: ```suggestion - **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication manager can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). + +## Error handling and volume health + +When a scanner finds corruption in a container: + +1. **Mark unhealthy** — The Datanode marks the container **UNHEALTHY**. +2. **Tell SCM** — The next **heartbeat** carries that state so SCM can treat the replica as bad and plan **re-replication** from good copies. +3. **Volume check** — Because corruption may indicate **failing media**, the Datanode can trigger a **volume-level health check** on the underlying disk. + +### Volume scanner (`StorageVolumeChecker`) Review Comment: This should be its own page. It does not check containers so it is confusing to have it in a page with "Containers" in the title. There is a lot more detail that can be provided as to how we determine a volume is failed and a separate config reference so we don't need to inline config keys into the text explanations. ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | +| **On-demand scanner** | Runs when a container is **opened** or when corruption is **suspected** during normal I/O. | Uses its own throttle (`hdds.container.scrub.on.demand.volume.bytes.per.second` in the [appendix](../appendix)). | + +Developer-only toggles `hdds.container.scrub.dev.metadata.scan.enabled` and `hdds.container.scrub.dev.data.scan.enabled` can turn off background metadata/data scanning for testing; do not use that in production clusters. + +**Reference implementations** in [`apache/ozone`](https://github.com/apache/ozone) (`hadoop-hdds/container-service/.../ozoneimpl/`): [`AbstractBackgroundContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/AbstractBackgroundContainerScanner.java), [`BackgroundContainerMetadataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerMetadataScanner.java), [`BackgroundContainerDataScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/BackgroundContainerDataScanner.java), [`OnDemandContainerScanner`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OnDemandContainerScanner.java). + +## Error handling and volume health + +When a scanner finds corruption in a container: + +1. **Mark unhealthy** — The Datanode marks the container **UNHEALTHY**. +2. **Tell SCM** — The next **heartbeat** carries that state so SCM can treat the replica as bad and plan **re-replication** from good copies. +3. **Volume check** — Because corruption may indicate **failing media**, the Datanode can trigger a **volume-level health check** on the underlying disk. + +### Volume scanner (`StorageVolumeChecker`) + +The **volume scanner** (see [`StorageVolumeChecker.java`](https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolumeChecker.java) in `apache/ozone`) probes **physical** volume health: + +- Runs on a fixed cadence (default **60 minutes** — `hdds.datanode.periodic.disk.check.interval.minutes`). +- Performs small **read/write** checks to see whether the volume responds reliably. +- If the volume fails, it is marked **FAILED**; containers on it are treated as **lost** to SCM so the cluster can recover elsewhere. + +Related keys: `hdds.datanode.disk.check.io.test.count`, `hdds.datanode.disk.check.io.failures.tolerated`, `hdds.datanode.disk.check.timeout`, `hdds.datanode.disk.check.min.gap` ([appendix](../appendix)). Setting `hdds.datanode.disk.check.io.test.count` to **0** disables disk I/O checks. + +### When a Datanode exits + +If **too many volumes fail**, the Datanode **stops** rather than staying in the cluster with no usable storage. Thresholds are per **volume category** (data, metadata, DB): + +- **`hdds.datanode.failed.data.volumes.tolerated`** +- **`hdds.datanode.failed.metadata.volumes.tolerated`** +- **`hdds.datanode.failed.db.volumes.tolerated`** + +Default **`-1`** means “no fixed cap” in that dimension, but Ozone still expects **at least one healthy volume of each type** the node uses. If **all** volumes of a required type fail, the Datanode treats that as **fatal** and shuts down. + +For Datanode volume and directory layout, see `hdds.datanode.dir`, `hdds.datanode.container.db.dir`, and related keys in the [configuration appendix](../appendix). + +## Configuration reference + +### Container scrub (`hdds.container.scrub.*`) + +| Key | Default | Description | Review Comment: This is missing `hdds.container.scrub.on.demand.volume.bytes.per.second`. ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | Review Comment: The "Footprint" section seems like too much of a catch-all. Probably two columns: one for number of threads and one for bandwidth throttle would be clearer. Also to keep the table concise we should probably not put config keys in it and leave those for the config reference section at the end. ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. + +For authoritative defaults and descriptions, use the [configuration appendix](../appendix) (search for `hdds.container.scrub` and `hdds.datanode`). Main scanner implementations in [`apache/ozone`](https://github.com/apache/ozone) are linked under [Scanner types](#scanner-types) below. + +## Overview + +Storage can corrupt data **without** an immediate I/O error. The container scanner **periodically validates** containers on each Datanode so problems surface **early**, while healthy replicas may still exist. + +**Why it matters** + +- **Early detection** — Bad replicas are flagged before a client depends on them. +- **Replication health** — SCM can keep the intended replica count by treating unhealthy copies as needing repair. +- **Automated recovery** — The Datanode reports container state to SCM; SCM’s replication machinery can schedule work using healthy copies (see also the [replication manager report](../../operations/container-replication-report)). + +## Scanner types + +Three paths balance coverage and cost: + +| Type | What it does | Footprint | +| --- | --- | --- | +| **Background metadata scanner** | Validates **container metadata** and internal metadata structures. | One thread **across all volumes** on the Datanode; relatively light. | +| **Background data scanner** | Reads **payload** data and checks it against **stored checksums**. | **One thread per volume**, heavily **throttled** (bandwidth limit). | Review Comment: ```suggestion | **Background data scanner** | Performs checksum validation on every byte of data on the volume | **One thread per volume**, with a read bandwidth limit to avoid interfering with workloads. | ``` ########## docs/05-administrator-guide/02-configuration/08-fault-tolerance/01-container-scanner.md: ########## @@ -0,0 +1,98 @@ +--- +sidebar_label: Container Scanner +--- + +# Ozone container scanner + +The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) before clients read them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. Review Comment: The original wording makes it sound like scanning containers is a pre-requisite to being able to read from them. ```suggestion The **container scanner** is a Datanode background service that helps protect against **silent data corruption** (“bit rot”) by verifying [storage containers](../../../core-concepts/replication/storage-containers) even when clients are not reading them. This page summarizes how it works, how failures propagate to [Storage Container Manager (SCM)](../../../core-concepts/architecture/storage-container-manager), and which settings operators tune. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
