balodesecurity opened a new pull request, #8326:
URL: https://github.com/apache/hadoop/pull/8326
## Summary
The **Nodes in Service** count displayed per storage type in the NameNode UI
(DFS Storage Types section) could become grossly incorrect — e.g. showing 3
nodes when the cluster has 26.
### Root Cause
`DatanodeStats.StorageTypeStatsMap` maintains a `StorageTypeStats` entry per
storage type with an incremental `nodesInService` counter. The map entry was
removed whenever `nodesInService` dropped to 0 — even when
decommissioning/maintenance nodes still used the same storage type.
The premature removal caused a cascade:
1. Node A (last in-service node for DISK) starts decommissioning →
`nodesInService` drops to 0 → **DISK entry removed**.
2. Next heartbeat from any node recreates the entry fresh (`nodesInService =
0`).
3. When in-service node B heartbeats: `subtract(B)` runs against the fresh
entry → `nodesInService: 0→-1`. Then `add(B)` → `nodesInService: -1→0`. **B's
in-service contribution is lost**.
4. After enough such cycles the reported count is far below the real number.
### Fix
Add a `totalNodes` counter to `StorageTypeStats` that tracks **all** nodes
using a storage type (in-service + decommissioning + maintenance). Change the
map-entry removal condition from `nodesInService == 0` to `totalNodes == 0`. An
entry is now only removed when no node of any admin state still uses that
storage type.
**Changed files:**
- `StorageTypeStats.java` — new `totalNodes` field; `addNode`/`subtractNode`
always update it; new `getTotalNodes()` accessor
- `DatanodeStats.java` — removal condition updated to `getTotalNodes() == 0`
- `TestStorageTypeStatsMap.java` — 4 new unit tests (new file)
## Test plan
- [x] `TestStorageTypeStatsMap` (4 tests) — PASS
- `testBasicAddRemove` — basic correctness
- `testEntryNotRemovedWhenDecommissioningNodeRemains` — entry survives
when a decommissioning node still uses the storage type; nodesInService stays
correct
- `testEntryNotRemovedWhenLastInServiceDecommissions` — entry survives
when the last in-service node decommissions; new in-service node is counted
correctly
- `testEntryRemovedOnlyWhenAllNodesGone` — entry removed only after all
nodes (including decommissioning) are gone
- [ ] Full blockmanagement test suite (CI)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]