Ashish Kumar created HDDS-14921:
-----------------------------------
Summary: Improve space accounting in SCM with In-Flight container
allocation tracking
Key: HDDS-14921
URL: https://issues.apache.org/jira/browse/HDDS-14921
Project: Apache Ozone
Issue Type: Improvement
Reporter: Ashish Kumar
Assignee: Ashish Kumar
The current disk space management and container allocation mechanism in SCM
(Storage Container Manager) relies heavily on periodic DataNode (DN) heartbeat
reports and static policies such as pipeline count based on disk numbers. This
approach introduces multiple systemic challenges:
# *Stale Space Visibility*
SCM makes allocation decisions based on heartbeat-reported disk space, which
can lag behind actual usage. During high allocation rates, this delay leads to
inaccurate space estimation and potential over-allocation.
# *Burst Allocation Risk*
Rapid container allocations within short intervals are not accounted for
immediately, allowing multiple allocations against the same reported free
space. This can oversubscribe disks and result in sudden disk exhaustion.
*Solution:*
Two Window Tumbling Bucket similar like HADOOP-3707
*Two Windows Per DataNode*
Each DataNode has a TwoWindowBucket containing:
- currentWindow: Containers allocated in the current 10-minute interval
- previousWindow: Containers from the previous 10-minute interval
- lastRollTime: Timestamp of last roll
*Container Allocation Flow: New Container Allocated*
- Add ContainerID to currentWindow
- Check if roll needed (time > lastRollTime + 10min)
- If yes: previousWindow = currentWindow; currentWindow = {}
*Space check: Get Pending Allocations*
- Roll if needed
- Return UNION(currentWindow, previousWindow)
- pendingSize = union.size() × maxContainerSize
- effectiveSpace = remainingSpace - pendingSize
*Container report: Container Report Received*
- Remove ContainerID from BOTH windows
- More accurate than waiting for automatic aging
- Falls back to aging if report is delayed/missed **
*Automatic Aging: Roll*
Every 10 Minutes (Triggered Lazily on Operations):
1. previousWindow = currentWindow
2. currentWindow = {} (new empty set)
3. lastRollTime = now
4. Old previousWindow is garbage collected
+*Timeline Example*+
Time | Action | CurrentWindow | PreviousWindow | Total
Pending
------+---------------------------+---------------+----------------+--------------
00:00 | Allocate Container-1 | \{C1} | {} | \{C1}
00:05 | Allocate Container-2 | \{C1, C2} | {} | \{C1, C2}
00:08 | Allocate Container-3 | \{C1, C2, C3} | {} | \{C1, C2,
C3}
00:10 | [ROLL] Window tumbles | {} | \{C1, C2, C3} | \{C1, C2,
C3}
| ⤷ previousWindow ← currentWindow
| ⤷ currentWindow ← {} (reset)
------+---------------------------+---------------+----------------+--------------
00:12 | Allocate Container-4 | \{C4} | \{C1, C2, C3} | \{C1,
C2, C3, C4}
00:15 | Report confirms C1 | \{C4} | \{C2, C3} | \{C2,
C3, C4}
| ⤷ Explicitly removed from previousWindow
00:18 | Allocate Container-5 | \{C4, C5} | \{C2, C3} | \{C2,
C3, C4, C5}
00:20 | [ROLL] Window tumbles | {} | \{C4, C5} | \{C4, C5}
| ⤷ C2, C3 aged out (not reported in 20 min)
------+---------------------------+---------------+----------------+--------------
00:25 | Report confirms C4 | {} | \{C5} | \{C5}
00:30 | [ROLL] Window tumbles | {} | {} | {}
| ⤷ C5 aged out (not reported in 20 min)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]