adoroszlai opened a new pull request #2352:
URL: https://github.com/apache/ozone/pull/2352
## What changes were proposed in this pull request?
After HDDS-5268, disk check is timing out although all volumes are healthy:
```
2021-06-21 09:35:49,202 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
2021-06-21 09:35:49,204 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
2021-06-21 09:35:49,205 [DataNode DiskChecker thread 1] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
2021-06-21 09:36:19,169 [Periodic HDDS volume checker] WARN
volume.StorageVolumeChecker: checkAllVolumes timed out after 30000 ms
2021-06-21 09:36:19,169 [Periodic HDDS volume checker] WARN
volume.MutableVolumeSet: checkAllVolumes got 1 failed volumes -
[/data/hdds/hdds]
2021-06-21 09:36:19,172 [Periodic HDDS volume checker] INFO
volume.MutableVolumeSet: Moving Volume : /data/hdds/hdds to failed Volumes
2021-06-21 09:36:19,172 [Periodic HDDS volume checker] ERROR
statemachine.DatanodeStateMachine: DatanodeStateMachine Shutdown due to too
many bad volumes, check hdds.datanode.failed.data.volumes.tolerated and
hdds.datanode.failed.metadata.volumes.tolerated
```
This can be reproduced simply by starting up the cluster and waiting long
enough for the disk checker to kick in and then time out.
The problem is due to locking:
* `Periodic HDDS volume checker` thread holds the lock on
`StorageVolumeChecker` (acquired in `checkAllVolumeSets()`), and calls
`latch.await()`
* `DataNode DiskChecker thread 1` waits for the lock in
`ResultHandler.markHealthy()`, hence does not reach the code that would
`countDown()` the latch
* `Periodic HDDS volume checker` proceeds after the timeout period
Since each `checkAllVolumes()` call uses its own `healthyVolumes` and
`failedVolumes` sets, these don't need to be guarded by the lock on
`StorageVolumeChecker`. This way the deadlock can be avoided.
https://issues.apache.org/jira/browse/HDDS-5363
## How was this patch tested?
Tested locally on Docker Compose cluster with short disk checker interval:
```
OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=1m
OZONE-SITE.XML_hdds.datanode.disk.check.timeout=30s
OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=2
```
Disk checker did not time out and did not cause datanode shutdown:
```
datanode_2 | 2021-06-21 10:15:00,071 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_2 | 2021-06-21 10:15:00,073 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_2 | 2021-06-21 10:15:00,074 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_2 | 2021-06-21 10:15:00,076 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_2 | 2021-06-21 10:15:00,076 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_2 | 2021-06-21 10:15:00,077 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 7 is less than the minimum gap between checks (60000 ms).
datanode_2 | 2021-06-21 10:15:00,077 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_3 | 2021-06-21 10:15:00,277 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_3 | 2021-06-21 10:15:00,278 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_3 | 2021-06-21 10:15:00,280 [DataNode DiskChecker thread 1] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_3 | 2021-06-21 10:15:00,280 [DataNode DiskChecker thread 1] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_3 | 2021-06-21 10:15:00,281 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_3 | 2021-06-21 10:15:00,282 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 6 is less than the minimum gap between checks (60000 ms).
datanode_3 | 2021-06-21 10:15:00,282 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_1 | 2021-06-21 10:15:00,432 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_1 | 2021-06-21 10:15:00,435 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_1 | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_1 | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_1 | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_1 | 2021-06-21 10:15:00,438 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 6 is less than the minimum gap between checks (60000 ms).
datanode_1 | 2021-06-21 10:15:00,438 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_2 | 2021-06-21 10:16:59,939 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_2 | 2021-06-21 10:16:59,941 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_2 | 2021-06-21 10:16:59,942 [DataNode DiskChecker thread 2] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_2 | 2021-06-21 10:16:59,942 [DataNode DiskChecker thread 2] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_2 | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_2 | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 4 is less than the minimum gap between checks (60000 ms).
datanode_2 | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_3 | 2021-06-21 10:17:00,145 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_3 | 2021-06-21 10:17:00,151 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_3 | 2021-06-21 10:17:00,152 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_3 | 2021-06-21 10:17:00,152 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_3 | 2021-06-21 10:17:00,153 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_3 | 2021-06-21 10:17:00,153 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 8 is less than the minimum gap between checks (60000 ms).
datanode_3 | 2021-06-21 10:17:00,154 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_1 | 2021-06-21 10:17:00,300 [Periodic HDDS volume checker] INFO
volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
datanode_1 | 2021-06-21 10:17:00,305 [Periodic HDDS volume checker] INFO
volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
datanode_1 | 2021-06-21 10:17:00,307 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
datanode_1 | 2021-06-21 10:17:00,308 [Periodic HDDS volume checker] DEBUG
volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
datanode_1 | 2021-06-21 10:17:00,310 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
datanode_1 | 2021-06-21 10:17:00,315 [Periodic HDDS volume checker] TRACE
volume.StorageVolumeChecker: Skipped checking all volumes, time since last
check 15 is less than the minimum gap between checks (60000 ms).
datanode_1 | 2021-06-21 10:17:00,315 [Periodic HDDS volume checker] DEBUG
volume.MutableVolumeSet: checkAllVolumes encountered no failures
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]