[
https://issues.apache.org/jira/browse/HBASE-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-18372.
-----------------------------------------
Resolution: Cannot Reproduce
> Potential infinite busy loop in HMaster's ProcedureExecutor
> -----------------------------------------------------------
>
> Key: HBASE-18372
> URL: https://issues.apache.org/jira/browse/HBASE-18372
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 1.3.1
> Environment: Kernel 3.10.0-327.10.1.el7.x86_64
> JVM 1.8.0_102
> Reporter: Benoit Sigoure
> Priority: Major
>
> While investigating an issue today with [~timoha] we saw the HMaster
> consistently burning 1.5 cores of CPU cycles. Upon looking more closely, it
> was actually all 8 threads of {{ProcedureExecutor}} thread pool taking
> constantly ~15% of a CPU core each (I identified this by looking at
> individual threads in {{top}} and cross-referencing the thread IDs with the
> thread IDs in a JVM stack trace). The HMaster log or output didn't contain
> anything suspicious and it was hard for us to ascertain what exactly was
> happening. It just looked like these threads were regularly spinning, doing
> nothing. We just saw a lot of {{futex}} system calls happening all the time,
> and all the threads of the thread pool regularly taking turns in waking up
> and going back to sleep.
> My reading of the code in {{procedure2/ProcedureExecutor.java}} is that this
> can happen if the threads in the thread pool have been interrupted for some
> reason:
> {code}
> private void execLoop() {
> while (isRunning()) {
> Procedure proc = runnables.poll();
> if (proc == null) continue;
> {code}
> and then in {master/procedure/MasterProcedureScheduler.java}:
> {code}
> @Override
> public Procedure poll() {
> return poll(-1);
> }
> @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP")
> Procedure poll(long waitNsec) {
> Procedure pollResult = null;
> schedLock.lock();
> try {
> if (queueSize == 0) {
> if (waitNsec < 0) {
> schedWaitCond.await();
> [...]
> } catch (InterruptedException e) {
> Thread.currentThread().interrupt();
> } finally {
> schedLock.unlock();
> }
> return pollResult;
> }
> {code}
> so my theory is the threads in the thread pool have all been interrupted
> (maybe by a procedure that ran earlier and left its thread interrupted) and
> so we are perpetually looping in {{execLoop}}, which ends up calling
> {{schedWaitCond.await();}}, which ends up throwing an
> {{InterruptedException}}, which ends up resetting the interrupt status of the
> thread, and rinse and repeat.
> But again I wasn't able to get any cold hard evidence that this is what was
> happening. There was just no other evidence that could explain this
> behavior, and I wasn't able to guess what else could be causing this that was
> consistent with what we saw and what I understood from reading the code.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)