[ 
https://issues.apache.org/jira/browse/STORM-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014430#comment-14014430
 ] 

Srinath commented on STORM-323:
-------------------------------

This issue was debugged and the root cause was due to a spout blocking on a 
queue for prolonged duration of time.
The spout would only unblock itself upon some activity that was rarely 
occurring. Once this was corrected, the issue has not been seen again. Heap 
dumps do not show any signs of these tuples getting accumulated.

On a side note, the following would have helped in troubleshooting:
1. Looking into the dumps, it wasn't possible to trace the spout/bolt whose 
RingBuffers held these unacked tuples. 
2. Some documentation on the internals like async_loop, RingBuffer, etc just 
for the sake of those who are not familiar with clojure

> Unacknowledged __tick and __metrics_tick tuples hangs worker processes
> ----------------------------------------------------------------------
>
>                 Key: STORM-323
>                 URL: https://issues.apache.org/jira/browse/STORM-323
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.1-incubating
>         Environment: Storm:
> Nimbus, Supervisor and Zookeeper running on Centos 6.2 over m1.small 
> instances (1.7G mem, 1 CPU, 1 core)
> Netty as the transport
> Topology:
> 2 worker processes on the same supervisor instance each allocated 512 Mb of 
> heap
> Each of the worker processes have around 30 executors running around 112 
> tasks.
>            Reporter: Srinath
>            Priority: Critical
>
> Symptoms observed:
> 1. One of the bolts not getting executed after about 5 days of run
> 2. Spout gradually slows down and finally stops calling nextTuple()
> 3. Topology is non-functional since there is no exchange of tuples across 
> worker processes
> Notes from troubleshooting:
> 1. There is a transfer of data across worker processes but the bolt is not 
> receiving the tuples
> 2. backtype.storm.messaging.netty.Server#message_queue is not getting 
> consumed.
> 3. Later on found that there are several __tick and __metrics_tick tuples 
> piling up in memory over a period of time. This piling up is gradual and 
> probably the reason why it takes so long for it to cause any visible problems.
> I have shared access to thread dumps and topology layout at 
> https://drive.google.com/folderview?id=0B2F_3UACQZNESXpwZlA4MFlqSVU&usp=drive_web



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to