Agent deadlock possible due to blocked latch in driver thread.
--------------------------------------------------------------
Key: FLUME-768
URL: https://issues.apache.org/jira/browse/FLUME-768
Project: Flume
Issue Type: Bug
Components: Node
Affects Versions: v0.9.4
Reporter: Jonathan Hsieh
Fix For: v0.9.5
There are three threads essentially blocked. 2 of the three are blocked because
of the 3rd.
The main problem is that roll close is blocked attempting for a close to
complete. It has a subordinate thread that seems to be gone normally triggers
the latch that allows it to close. My guess is some exception in that
TriggerThread exited and because the latch countdowns aren't present, the ok to
shutdown latch never got cleared.
The other two threads are blocked because this -- and likely wouldn't get stuck
here if that intermediate threads wasn't stuck.
The agent's avro source queue is full and it is blocked trying to enqueue more
data.
There is also another thread that is blocked -- it is wal draining thread is
blocked with nothing left to do (why everything is in sent state). This
doesn't seem to be part of the problem.
Thread 21 (448511246@qtp-1388647956-1):
State: WAITING
Blocked count: 3
Waited count: 29
Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@11031d18
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:306)
com.cloudera.flume.handlers.avro.AvroEventSource.enqueue(AvroEventSource.java:114)
com.cloudera.flume.handlers.avro.AvroEventSource$1.append(AvroEventSource.java:135)
sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
org.apache.avro.specific.SpecificResponder.respond(SpecificResponder.java:93)
org.apache.avro.ipc.Responder.respond(Responder.java:136)
org.apache.avro.ipc.Responder.respond(Responder.java:88)
org.apache.avro.ipc.ResponderServlet.doPost(ResponderServlet.java:48)
javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:390)
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
org.mortbay.jetty.Server.handle(Server.java:326)
Here's another thread that is essentially blocked:
Thread 19 (logicalNode agent-19):
State: WAITING
Blocked count: 83
Waited count: 1143043
Waiting on java.util.concurrent.CountDownLatch$Sync@5c328896
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
com.cloudera.flume.handlers.rolling.RollSink.close(RollSink.java:213)
com.cloudera.flume.agent.durability.NaiveFileWALDeco.close(NaiveFileWALDeco.java:147)
com.cloudera.flume.agent.AgentSink.close(AgentSink.java:118)
com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.java:67)
com.cloudera.flume.handlers.debug.LazyOpenDecorator.close(LazyOpenDecorator.java:81)
com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:121)
Here's the wal draining thread trying to pull things out of the wal.
Thread 24 (naive file wal transmit-24):
State: TIMED_WAITING
Blocked count: 156
Waited count: 171352
Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
com.cloudera.flume.agent.durability.NaiveFileWALManager.getUnackedSource(NaiveFileWALManager.java:763)
com.cloudera.flume.agent.durability.WALSource.next(WALSource.java:104)
com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:91
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira