erzhen wu created HADOOP-19918:
----------------------------------

             Summary: MutableQuantiles.add() causes severe lock contention, 
blocking all IPC Handler threads and degrading RPC performance
                 Key: HADOOP-19918
                 URL: https://issues.apache.org/jira/browse/HADOOP-19918
             Project: Hadoop Common
          Issue Type: Bug
          Components: hadoop-common, metrics
    Affects Versions: 3.3.3
            Reporter: erzhen wu
         Attachments: image-2026-06-11-18-43-36-472.png, 
image-2026-06-11-18-54-43-972.png, image-2026-06-11-18-55-17-240.png

When `rpc.metrics.quantile.enable=true` is enabled in production NameNode, 
we observed severe lock contention causing all IPC Handler threads to become
BLOCKED on `MutableQuantiles.add()`, resulting in:
 - *RpcQueueTimeAvgTime increases significantly* - Requests queuing in 
CallQueue longer
 - *CallQueueLength increases dramatically* - More requests waiting to be 
processed
 - *NameNode RPC throughput degradation* - Handler threads blocked on metrics 
update

h3. Thread Dump Analysis

>From production NameNode jstack, we found *{*}247 IPC Handler threads 
>BLOCKED{*}* 
waiting on the same `MutableQuantiles` monitor lock:

 
{code:java}
Thread 283 (IPC Server handler 191 on default port 8020): 
State: BLOCKED 
Blocked count: 15596 
Waited count: 28409 
Blocked on org.apache.hadoop.metrics2.lib.MutableQuantiles@659a2455 
Blocked by 300 (IPC Server handler 208 on default port 8020) 
Stack: 
org.apache.hadoop.metrics2.lib.MutableQuantiles.add(MutableQuantiles.java:133) 
org.apache.hadoop.ipc.metrics.RpcMetrics.addRpcQueueTime(RpcMetrics.java:245) 
org.apache.hadoop.ipc.Server.updateMetrics(Server.java:587) 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:3008){code}
 
h3. Root Cause

The `synchronized` keyword on `MutableQuantiles.add()` and 
`SampleQuantiles.insert()` 
creates a global lock contention point:

 
{code:java}
/**
 * Add a new value from the stream.
 * 
 * @param v
 */
synchronized public void insert(long v) {

  buffer[bufferCount] = v;
  bufferCount++;

  count++;

  if (bufferCount == buffer.length) {
    insertBatch();
    compress();
  }
}                                                                       {code}
 

Lock Contention Chain

When bufferCount == buffer.length (buffer is full), the thread holding the lock 
executes insertBatch() + compress() which can take milliseconds, during
which all other Handler threads are BLOCKED:

 
{code:java}
Thread A (holds lock): 
→ MutableQuantiles.add() 
→ SampleQuantiles.insert() 
→ buffer is full (500 elements) 
→ insertBatch() // sorting 500 elements, holds lock 
→ compress() // compressing samples, holds lock 
→ Thread.sleep() in simulateProblemDelay()
Thread B, C, D, ... (247 threads): 
→ MutableQuantiles.add() 
→ BLOCKED waiting for Thread A to release lock 
→ Cannot process next RPC request 
→ CallQueueLength increases 
→ RpcQueueTimeAvgTime increases
{code}
h3. Metrics 

!image-2026-06-11-18-54-43-972.png!

!image-2026-06-11-18-55-17-240.png!
h3.  Impact Analysis                                                            
                                                             The blocking 
happens in updateMetrics() after RPC processing is complete: 

 
{code:java}
// Server.java - Handler.run()                                                  
                
  void updateMetrics(Call call, long startTime, boolean connDropped) {
      // ...                                                                    
                 
      rpcMetrics.addRpcQueueTime(queueTime);        // ← BLOCKS HERE (95% of 
threads)
      rpcMetrics.addRpcLockWaitTime(waitTime);     
rpcMetrics.addRpcProcessingTime(processingTime);                            
  }  {code}
 

 
h3.   This means: 

                       
  1. Handler thread completes RPC processing successfully   
  2. Calls updateMetrics() to report queue time                                 
                                       3. First call addRpcQueueTime() → BLOCKS 
on MutableQuantiles lock
  4. Handler thread cannot process next request in CallQueue                    
                            5. CallQueueLength increases as new requests arrive 
                                                          6. 
RpcQueueTimeAvgTime increases as requests wait longer in queue 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to