erzhen wu created HADOOP-19918:
----------------------------------
Summary: MutableQuantiles.add() causes severe lock contention,
blocking all IPC Handler threads and degrading RPC performance
Key: HADOOP-19918
URL: https://issues.apache.org/jira/browse/HADOOP-19918
Project: Hadoop Common
Issue Type: Bug
Components: hadoop-common, metrics
Affects Versions: 3.3.3
Reporter: erzhen wu
Attachments: image-2026-06-11-18-43-36-472.png,
image-2026-06-11-18-54-43-972.png, image-2026-06-11-18-55-17-240.png
When `rpc.metrics.quantile.enable=true` is enabled in production NameNode,
we observed severe lock contention causing all IPC Handler threads to become
BLOCKED on `MutableQuantiles.add()`, resulting in:
- *RpcQueueTimeAvgTime increases significantly* - Requests queuing in
CallQueue longer
- *CallQueueLength increases dramatically* - More requests waiting to be
processed
- *NameNode RPC throughput degradation* - Handler threads blocked on metrics
update
h3. Thread Dump Analysis
>From production NameNode jstack, we found *{*}247 IPC Handler threads
>BLOCKED{*}*
waiting on the same `MutableQuantiles` monitor lock:
{code:java}
Thread 283 (IPC Server handler 191 on default port 8020):
State: BLOCKED
Blocked count: 15596
Waited count: 28409
Blocked on org.apache.hadoop.metrics2.lib.MutableQuantiles@659a2455
Blocked by 300 (IPC Server handler 208 on default port 8020)
Stack:
org.apache.hadoop.metrics2.lib.MutableQuantiles.add(MutableQuantiles.java:133)
org.apache.hadoop.ipc.metrics.RpcMetrics.addRpcQueueTime(RpcMetrics.java:245)
org.apache.hadoop.ipc.Server.updateMetrics(Server.java:587)
org.apache.hadoop.ipc.Server$Handler.run(Server.java:3008){code}
h3. Root Cause
The `synchronized` keyword on `MutableQuantiles.add()` and
`SampleQuantiles.insert()`
creates a global lock contention point:
{code:java}
/**
* Add a new value from the stream.
*
* @param v
*/
synchronized public void insert(long v) {
buffer[bufferCount] = v;
bufferCount++;
count++;
if (bufferCount == buffer.length) {
insertBatch();
compress();
}
} {code}
Lock Contention Chain
When bufferCount == buffer.length (buffer is full), the thread holding the lock
executes insertBatch() + compress() which can take milliseconds, during
which all other Handler threads are BLOCKED:
{code:java}
Thread A (holds lock):
→ MutableQuantiles.add()
→ SampleQuantiles.insert()
→ buffer is full (500 elements)
→ insertBatch() // sorting 500 elements, holds lock
→ compress() // compressing samples, holds lock
→ Thread.sleep() in simulateProblemDelay()
Thread B, C, D, ... (247 threads):
→ MutableQuantiles.add()
→ BLOCKED waiting for Thread A to release lock
→ Cannot process next RPC request
→ CallQueueLength increases
→ RpcQueueTimeAvgTime increases
{code}
h3. Metrics
!image-2026-06-11-18-54-43-972.png!
!image-2026-06-11-18-55-17-240.png!
h3. Impact Analysis
The blocking
happens in updateMetrics() after RPC processing is complete:
{code:java}
// Server.java - Handler.run()
void updateMetrics(Call call, long startTime, boolean connDropped) {
// ...
rpcMetrics.addRpcQueueTime(queueTime); // ← BLOCKS HERE (95% of
threads)
rpcMetrics.addRpcLockWaitTime(waitTime);
rpcMetrics.addRpcProcessingTime(processingTime);
} {code}
h3. This means:
1. Handler thread completes RPC processing successfully
2. Calls updateMetrics() to report queue time
3. First call addRpcQueueTime() → BLOCKS
on MutableQuantiles lock
4. Handler thread cannot process next request in CallQueue
5. CallQueueLength increases as new requests arrive
6.
RpcQueueTimeAvgTime increases as requests wait longer in queue
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]