Viraj Jasani created HBASE-26012:
------------------------------------

             Summary: Improve logging and dequeue logic in DelayQueue
                 Key: HBASE-26012
                 URL: https://issues.apache.org/jira/browse/HBASE-26012
             Project: HBase
          Issue Type: Improvement
            Reporter: Viraj Jasani
            Assignee: Viraj Jasani


In Remote Procedure dispatcher, before submitting (sub)Procedure to Thread 
pool, we enqueue it as DelayedWithTimeout object on DelayQueue. 

TimeoutExecutorThread keeps dequeuing elements from this DelayQueue and submit 
the Procedure to the threadpool. The expiration of DelayedWithTimeout is 
determined by getDelay(TimeUnit):

 
{code:java}
@Override
public long getDelay(final TimeUnit unit) {
  return DelayedUtil.getRemainingTime(unit, getTimeout());
}
{code}
{code:java}
  /**
   * @return Time remaining as milliseconds.
   */
  public static long getRemainingTime(final TimeUnit resultUnit, final long 
timeout) {
    final long currentTime = EnvironmentEdgeManager.currentTime();
    if (currentTime >= timeout) {
      return 0;
    }
    return resultUnit.convert(timeout - currentTime, TimeUnit.MILLISECONDS);
  }
{code}
Hence, in order for the elements to get dequeued on time, it is necessary that 
EnvironmentEdgeManager.currentTime() returns the current time in millis.

As part of unit test, if we use our own custom EnvironmentEdge and inject it 
using EnvironmentEdgeManager.injectEdge before creating any tables, it is 
possible that we continue returning same value (based on custom impl) with 
EnvironmentEdgeManager.currentTime(). If that is the case, getRemainingTime as 
mentioned above, will never return 0 and hence, the procedure wrapped in 
DelayedWithTimeout might never be dequeued from DelayQueue because it's delay 
will not expire.

As of today, our system goes in hanging state while waiting for table regions 
to be available (as mentioned above, DelayedWithTimeout object never gets 
dequeued from DelayQueue).

Thread dump might show something like this consistently:
{code:java}
"ProcedureDispatcherTimeoutThread" #319 daemon prio=5 os_prio=31 
tid=0x00007fcaf0cae800 nid=0x21d03 waiting on condition [0x0000700019293000]
   java.lang.Thread.State: TIMED_WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  <0x00000007225a0090> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
  at java.util.concurrent.DelayQueue.take(DelayQueue.java:223)
  at 
org.apache.hadoop.hbase.procedure2.util.DelayedUtil.takeWithoutInterrupt(DelayedUtil.java:82)
  at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$TimeoutExecutorThread.run(RemoteProcedureDispatcher.java:314)
   Locked ownable synchronizers:
  - None
{code}
Although running into situation like this is not likely possible unless custom 
EnvironmentEdge is used as mentioned above, we should improve our dequeue logic 
as well as log important message to show where we are stuck.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to