[
https://issues.apache.org/jira/browse/STORM-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984901#comment-13984901
]
Sushant Kumar commented on STORM-299:
-------------------------------------
Thanks [~revans2] to discuss this with me.
The root cause was a spout that blocks in the nextTuple method for a long time
(hours) listening on a socket. Note that the spout doesn't emit anything in
that time frame and therefore doesn't have any pending ack or fail to be
called; topology doesn't use anchoring as well.
Apparently spout's receive queue is used by storm for metrics and system
stream. Spout's thread (the one that also calls nextTuple) pulls these ticks
out from that queue. The metrics and system stream are driven by a global timer
thread, which gets blocked on trying to publish a new tuple to the spout's
receive queue. The blocked global timer thread causes a complete halt of tick
tuples to other bolt instances.
It will be good to document that a spout should NEVER sleep more than few
seconds to keep the system running.
> tick tuples stop in a long running topology
> -------------------------------------------
>
> Key: STORM-299
> URL: https://issues.apache.org/jira/browse/STORM-299
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Affects Versions: 0.9.0.1
> Reporter: Sushant Kumar
>
> We are running a topology that processes about 10K tuples a second on a
> single worker. There are about 80 executor threads across 3 different Bolt
> types, each receiving a tick tuple at a frequency of 1 second. We observe
> that the tick tuples stop coming in after few hours or sometime even after a
> day. The nimbus UI also shows that the tick tuples count has frozen whereas
> the same set of bolts continue to process tuples from other spouts. I have
> taken a stack trace and it shows following:
> "Thread-8" daemon prio=10 tid=0x00007f1a0fed2000 nid=0x4438 runnable
> [0x00007f19a5e01000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349)
> at
> com.lmax.disruptor.AbstractMultithreadedClaimStrategy.waitForFreeSlotAt(AbstractMultithreadedClaimStrategy.java:99)
> at
> com.lmax.disruptor.AbstractMultithreadedClaimStrategy.incrementAndGet(AbstractMultithreadedClaimStrategy.java:49)
> at com.lmax.disruptor.Sequencer.next(Sequencer.java:127)
> at
> backtype.storm.utils.DisruptorQueue.publish(DisruptorQueue.java:113)
> at backtype.storm.disruptor$publish.invoke(disruptor.clj:51)
> at backtype.storm.disruptor$publish.invoke(disruptor.clj:53)
> at
> backtype.storm.daemon.executor$setup_ticks_BANG_$fn__3885.invoke(executor.clj:299)
> at
> backtype.storm.timer$schedule_recurring$this__1839.invoke(timer.clj:79)
> at
> backtype.storm.timer$mk_timer$thread_fn__1822$fn__1823.invoke(timer.clj:32)
> at backtype.storm.timer$mk_timer$thread_fn__1822.invoke(timer.clj:25)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Thread.java:744)
>
> Locked ownable synchronizers:
> - None
>
> "Thread-268-__system" prio=10 tid=0x00007f1a0ef45000 nid=0x46b5 runnable
> [0x00007f17ef574000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x000000073d9a2060> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2176)
> at
> com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:87)
> at
> com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:54)
> at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:56)
> at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62)
> at
> backtype.storm.daemon.executor$eval4023$fn__4024$fn__4038$fn__4089.invoke(executor.clj:749)
> at backtype.storm.util$async_loop$fn__364.invoke(util.clj:400)
> at clojure.lang.AFn.run(AFn.java:24)
> at java.lang.Thread.run(Thread.java:744)
>
> Locked ownable synchronizers:
> - None
> We are running storm-0.9.0_wip21 and I would be glad to provide any other
> information.
> Also see https://issues.apache.org/jira/browse/STORM-106.
> Thanks,
> Sushant
--
This message was sent by Atlassian JIRA
(v6.2#6252)