[ 
https://issues.apache.org/jira/browse/HBASE-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell resolved HBASE-18372.
-----------------------------------------
    Resolution: Cannot Reproduce

> Potential infinite busy loop in HMaster's ProcedureExecutor
> -----------------------------------------------------------
>
>                 Key: HBASE-18372
>                 URL: https://issues.apache.org/jira/browse/HBASE-18372
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.1
>         Environment: Kernel 3.10.0-327.10.1.el7.x86_64
> JVM 1.8.0_102
>            Reporter: Benoit Sigoure
>            Priority: Major
>
> While investigating an issue today with [~timoha] we saw the HMaster 
> consistently burning 1.5 cores of CPU cycles.  Upon looking more closely, it 
> was actually all 8 threads of {{ProcedureExecutor}} thread pool taking 
> constantly ~15% of a CPU core each (I identified this by looking at 
> individual threads in {{top}} and cross-referencing the thread IDs with the 
> thread IDs in a JVM stack trace).  The HMaster log or output didn't contain 
> anything suspicious and it was hard for us to ascertain what exactly was 
> happening.  It just looked like these threads were regularly spinning, doing 
> nothing.  We just saw a lot of {{futex}} system calls happening all the time, 
> and all the threads of the thread pool regularly taking turns in waking up 
> and going back to sleep.
> My reading of the code in {{procedure2/ProcedureExecutor.java}} is that this 
> can happen if the threads in the thread pool have been interrupted for some 
> reason:
> {code}
>   private void execLoop() {
>     while (isRunning()) {
>       Procedure proc = runnables.poll();
>       if (proc == null) continue;
> {code}
> and then in {master/procedure/MasterProcedureScheduler.java}:
> {code}
>   @Override
>   public Procedure poll() {
>     return poll(-1);
>   }
>   @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP")
>   Procedure poll(long waitNsec) {
>     Procedure pollResult = null;
>     schedLock.lock();
>     try {
>       if (queueSize == 0) {
>         if (waitNsec < 0) {
>           schedWaitCond.await();
> [...]
>     } catch (InterruptedException e) {
>       Thread.currentThread().interrupt();
>     } finally {
>       schedLock.unlock();
>     }
>     return pollResult;
>   }
> {code}
> so my theory is the threads in the thread pool have all been interrupted 
> (maybe by a procedure that ran earlier and left its thread interrupted) and 
> so we are perpetually looping in {{execLoop}}, which ends up calling 
> {{schedWaitCond.await();}}, which ends up throwing an 
> {{InterruptedException}}, which ends up resetting the interrupt status of the 
> thread, and rinse and repeat.
> But again I wasn't able to get any cold hard evidence that this is what was 
> happening.  There was just no other evidence that could explain this 
> behavior, and I wasn't able to guess what else could be causing this that was 
> consistent with what we saw and what I understood from reading the code.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to