Viraj Jasani created HBASE-26012:
------------------------------------
Summary: Improve logging and dequeue logic in DelayQueue
Key: HBASE-26012
URL: https://issues.apache.org/jira/browse/HBASE-26012
Project: HBase
Issue Type: Improvement
Reporter: Viraj Jasani
Assignee: Viraj Jasani
In Remote Procedure dispatcher, before submitting (sub)Procedure to Thread
pool, we enqueue it as DelayedWithTimeout object on DelayQueue.
TimeoutExecutorThread keeps dequeuing elements from this DelayQueue and submit
the Procedure to the threadpool. The expiration of DelayedWithTimeout is
determined by getDelay(TimeUnit):
{code:java}
@Override
public long getDelay(final TimeUnit unit) {
return DelayedUtil.getRemainingTime(unit, getTimeout());
}
{code}
{code:java}
/**
* @return Time remaining as milliseconds.
*/
public static long getRemainingTime(final TimeUnit resultUnit, final long
timeout) {
final long currentTime = EnvironmentEdgeManager.currentTime();
if (currentTime >= timeout) {
return 0;
}
return resultUnit.convert(timeout - currentTime, TimeUnit.MILLISECONDS);
}
{code}
Hence, in order for the elements to get dequeued on time, it is necessary that
EnvironmentEdgeManager.currentTime() returns the current time in millis.
As part of unit test, if we use our own custom EnvironmentEdge and inject it
using EnvironmentEdgeManager.injectEdge before creating any tables, it is
possible that we continue returning same value (based on custom impl) with
EnvironmentEdgeManager.currentTime(). If that is the case, getRemainingTime as
mentioned above, will never return 0 and hence, the procedure wrapped in
DelayedWithTimeout might never be dequeued from DelayQueue because it's delay
will not expire.
As of today, our system goes in hanging state while waiting for table regions
to be available (as mentioned above, DelayedWithTimeout object never gets
dequeued from DelayQueue).
Thread dump might show something like this consistently:
{code:java}
"ProcedureDispatcherTimeoutThread" #319 daemon prio=5 os_prio=31
tid=0x00007fcaf0cae800 nid=0x21d03 waiting on condition [0x0000700019293000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007225a0090> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:223)
at
org.apache.hadoop.hbase.procedure2.util.DelayedUtil.takeWithoutInterrupt(DelayedUtil.java:82)
at
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$TimeoutExecutorThread.run(RemoteProcedureDispatcher.java:314)
Locked ownable synchronizers:
- None
{code}
Although running into situation like this is not likely possible unless custom
EnvironmentEdge is used as mentioned above, we should improve our dequeue logic
as well as log important message to show where we are stuck.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)