Re: Peeking into storm's internal buffers

Srinath C Tue, 13 May 2014 23:19:31 -0700

Attaching the topology description.



On Wed, May 14, 2014 at 9:40 AM, Srinath C <[email protected]> wrote:

> Hi Sean,
>    Thanks for the inputs. I'm attaching the full output of jstack of the
> two worker processes.
>    Note that my-bolt is actually named metrics-observer in these stack
> traces.
>    Hopefully this can help in figuring out who is blocking the delivery of
> messages.
>
>    I even have heap dumps if that can help but I can only share access to
> it privately.
>    I'll also work on masking out proprietary details from the topology
> configuration and topology layout so that I can share them.
>
> Thanks,
> Srinath.
>
>
>
> On Wed, May 14, 2014 at 8:35 AM, Sean Zhong <[email protected]> wrote:
>
>> Hi, Srinath,
>>
>> From your description, seems the receiver thread is blocked by other
>> components(not my-bolt).
>>
>> The receiver works as this:
>> loop fetch data from Server receiver queue -> deliver to task input
>> queues one
>> by one in a blocking way.
>>
>> From the stacktrace, the my-bolt input disruptor queue is empty, so it
>> cannot be blocked by my-bolt. If the Netty Server receiver queue really
>> has
>> incoming messages, then it is possible the receiver thread is blocked when
>> trying to deliver message to other tasks in the worker process.
>>
>> Here are the suggestions:
>> 1. Share with us how the topology looks like
>> 1. Post your topology configuration.
>> 2. Post the full stacktrace of the worker process.
>>
>> Sean
>>
>>
>> On Wed, May 14, 2014 at 8:35 AM, Srinath C <[email protected]> wrote:
>>
>> > Hi,
>> >     Forwarding this issue to the dev forum since I thought this would be
>> > more relevant to be discussed here.
>> >
>> > Thanks,
>> > Srinath.
>> >
>> >
>> > ---------- Forwarded message ----------
>> > From: Srinath C <[email protected]>
>> > Date: Tue, May 13, 2014 at 8:17 AM
>> > Subject: Re: Peeking into storm's internal buffers
>> > To: [email protected]
>> >
>> >
>> > Here is more info...
>> > My suspicion is that the queue
>> > backtype.storm.messaging.netty.Server#message_queue is not getting
>> consumed.
>> > A heap dump of the worker process reveals that the size of that queue is
>> > around 40k.
>> >
>> > I have snippets of the worker log that I can share right now. Some
>> thread
>> > stacks are also attached:
>> >
>> > my-spout is the name of the spout that emits tuple which will be
>> > transferred to all the tasks of the bolt my-bolt.
>> > my-bolt has 10 tasks numbered [43 52]. worker-2 has tasks 43, 45, 47, 49
>> > and 51 while worker-1 has tasks 44, 46, 48, 50 and 52.
>> >
>> > The logs below are related to the exchange between worker-1 and worker-2
>> > when my-spout emits a tuple
>> >
>> > worker-1
>> > **********
>> > 2014-05-12 13:04:18 n.a.my-spout[111] [DEBUG] nextTuple : emitting tuple
>> > [1399914258343]
>> > 2014-05-12 13:04:18 n.a.my-bolt[*50*] [DEBUG] execute : received a
>> status
>> > request with id 1399914258343
>> > 2014-05-12 13:04:18 n.a.my-bolt[*46*] [DEBUG] execute : received a
>> status
>> > request with id 1399914258343
>> > 2014-05-12 13:04:18 n.a.my-bolt[*48*] [DEBUG] execute : received a
>> status
>> > request with id 1399914258343
>> > 2014-05-12 13:04:18 b.s.m.n.StormClientHandler [DEBUG] 5 request(s) sent
>> > 2014-05-12 13:04:18 n.a.my-bolt[*52*] [DEBUG] execute : received a
>> status
>> > request with id 1399914258343
>> > 2014-05-12 13:04:18 n.a.my-bolt[*44*] [DEBUG] execute : received a
>> status
>> > request with id 1399914258343
>> > 2014-05-12 13:04:18 b.s.m.n.StormClientHandler [DEBUG] send/recv time
>> > (ms): 227523778
>> > 2014-05-12 13:04:18 b.s.d.worker [DEBUG] Doing heartbeat
>> > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1399914258,
>> > :storm-id "metrics-topology-1-1399685003", :executors #{[2 2] [66 66]
>> [68
>> > 68] [38 38] [70 70] [40 40] [72 72] [42 42] [74 74] [44 44] [76 76] [46
>> 46]
>> > [78 78] [48 48] [80 80] [50 50] [82 82] [52 52] [84 84] [86 86] [94 94]
>> [-1
>> > -1] [64 64] [96 96] [98 102] [8 12] [108 112] [18 22] [88 92] [58 62]
>> [28
>> > 32]}, :port 6700}
>> >
>> > worker-2
>> > ***********
>> > 2014-05-12 13:04:17 b.s.d.worker [DEBUG] Doing heartbeat
>> > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1399914257,
>> > :storm-id "metrics-topology-1-1399685003", :executors #{[67 67] [69 69]
>> [39
>> > 39] [71 71] [41 41] [73 73] [43 43] [75 75] [45 45] [77 77] [47 47] [79
>> 79]
>> > [49 49] [81 81] [51 51] [83 83] [85 85] [87 87] [93 93] [63 63] [95 95]
>> [-1
>> > -1] [1 1] [65 65] [97 97] [3 7] [103 107] [13 17] [53 57] [23 27] [33
>> 37]},
>> > :port 6701}
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *43*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *45*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *47*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *49*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *51*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.StormServerHandler [DEBUG] Send back
>> response
>> > ...
>> > 2014-05-12 13:04:18 b.s.d.worker [DEBUG] Doing heartbeat
>> > #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1399914258,
>> > :storm-id "metrics-topology-1-1399685003", :executors #{[67 67] [69 69]
>> [39
>> > 39] [71 71] [41 41] [73 73] [43 43] [75 75] [45 45] [77 77] [47 47] [79
>> 79]
>> > [49 49] [81 81] [51 51] [83 83] [85 85] [87 87] [93 93] [63 63] [95 95]
>> [-1
>> > -1] [1 1] [65 65] [97 97] [3 7] [103 107] [13 17] [53 57] [23 27] [33
>> 37]},
>> > :port 6701}
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *43*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *45*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *47*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *49*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> > *51*, payload size: 11
>> > 2014-05-12 13:04:18 b.s.m.n.StormServerHandler [DEBUG] Send back
>> response
>> > ...
>> > 2014-05-12 13:04:18 b.s.m.n.Server [DEBUG] message received with task:
>> 24,
>> > payload size: 1225
>> > 2014-05-12 13:04:18 b.s.m.n.StormServerHandler [DEBUG] Send back
>> response
>> > ...
>> >
>> > This confirms that worker-2 received tuples from worker-1 but the bolt
>> > tasks are not getting executed.
>> > In a normal system, the worker-2 would have shown additional logs like:
>> >
>> > 2014-05-12 19:42:33 b.s.m.n.Server [DEBUG] request to be processed:
>> > backtype.storm.messaging.TaskMessage@65c5e8d8
>> > 2014-05-12 19:42:33 b.s.m.n.Server [DEBUG] request to be processed:
>> > backtype.storm.messaging.TaskMessage@2439f11b
>> >
>> > which is logged when an executed consumes from that queue.
>> >
>> >
>> >
>> >
>> > On Mon, May 12, 2014 at 8:29 PM, Srinath C <[email protected]> wrote:
>> >
>> >> Hi Padma,
>> >>     See inline for my replies....
>> >>
>> >>
>> >> On Mon, May 12, 2014 at 4:39 PM, padma priya chitturi <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> Few questions on your issue:
>> >>>
>> >>> 1. As soon as you start the topology, is that the bolt execution is
>> not
>> >>> started forever ? or is it like after processing few tuples, bolt
>> execution
>> >>> has stuck. Can you give clear picture on this.
>> >>>
>> >>
>> >> <Srinath> Bolt execution is very normal in the beginning and for a long
>> >> time (about 1-2 days). And then abruptly stops.
>> >>
>> >>
>> >>>  2. You said that the behavior is seen on one of the worker process.
>> So
>> >>> can is that on the other worker process, bolt is executed without any
>> >>> issues ?
>> >>>
>> >>
>> >> <Srinath> Yes, I can see this only on one process. The bolt tasks in
>> >> other process are working fine.
>> >>
>> >>
>> >>> 3. What is the source from which spout reads the input ?
>> >>>
>> >>
>> >> <Srinath> The spout is reading from a kestrel queue.
>> >>
>> >>
>> >>>
>> >>> Also, it would be great if you could provide nimbus/supervisor and
>> >>> worker logs ?
>> >>>
>> >>
>> >> <Srinath> I'm currently troubleshooting this issue and looking into
>> >> worker logs. Will soon share some more details on my findings.
>> >> I don't see much in Nimbus and Supervisor logs. But the worker log
>> >> details seem to reveal some clues. Will update shortly.
>> >>
>> >>
>> >>>
>> >>>
>> >>> On Mon, May 12, 2014 at 6:57 AM, Srinath C <[email protected]>
>> wrote:
>> >>>
>> >>>> Hi,
>> >>>>    I'm facing a strange issue running a topology on version
>> >>>> 0.9.1-incubating with Netty as transport.
>> >>>>    The topology has two worker processes on the same worker machine.
>> >>>>
>> >>>>    To summarize the behavior, on one of the worker processes:
>> >>>>      - one of the bolts are not getting executed: The bolt has
>> multiple
>> >>>> executors of the same bolt but none of them are executed
>> >>>>      - the spouts in the same worker process are trying to emit
>> tuples
>> >>>> to the bolt but still the bolt is not executed
>> >>>>      - after a while the spout itself is not executed (nextTuple is
>> not
>> >>>> called)
>> >>>>
>> >>>>     My suspicion is that due to some internal buffers getting filled
>> up
>> >>>> the topology.max.spout.pending limit is hit and storm is no longer
>> invoking
>> >>>> the spout. The topology remains hung like this for a long time and
>> probably
>> >>>> for ever.
>> >>>>     From the jstack output, I could figure out that there were 5
>> >>>> threads lesser in the affected process than a normal process. The
>> thread
>> >>>> were having a stack as below:
>> >>>>
>> >>>> Thread 5986: (state = BLOCKED)
>> >>>>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame;
>> >>>> information may be imprecise)
>> >>>>  - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object,
>> >>>> long) @bci=20, line=226 (Compiled frame)
>> >>>>  -
>> >>>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long)
>> >>>> @bci=68, line=2082 (Compiled frame)
>> >>>>  -
>> >>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
>> >>>> @bci=122, line=1090 (Compiled frame)
>> >>>>  -
>> >>>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take()
>> >>>> @bci=1, line=807 (Interpreted frame)
>> >>>>  - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156,
>> >>>> line=1068 (Interpreted frame)
>> >>>>  -
>> >>>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
>> >>>> @bci=26, line=1130 (Interpreted frame)
>> >>>>  - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5,
>> >>>> line=615 (Interpreted frame)
>> >>>>  - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
>> >>>>
>> >>>>     Has anyone seen such an issue? Any idea if I can confirm my
>> >>>> suspicion of internal buffers getting filled up? What else can I
>> collect
>> >>>> from the processes for troubleshooting?
>> >>>>
>> >>>> Thanks,
>> >>>> Srinath.
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>
>

MetricsTopology

Spouts  (tasks, executors, description)
******
pr-metrics-stream (7, 7, read from kestrel queue)
server-metrics-stream (5, 5, read from kestrel queue)
service-metrics-stream (5, 5, read from kestrel queue)

cloud-watch-cluster-sync (5, 1, read from RabbitMQ queue)
metrics-observer-cluster-sync  (5, 1, read from RabbitMQ queue)
metrics-observer-service-sync  (5, 1, read from RabbitMQ queue)

service-status-ticker  (5, 1, clock tick every 30 seconds)
cluster-status-ticker  (5, 1, clock tick every 30 seconds)


Bolts  (tasks, executors, description)
******
cluster-event-dispatcher  (5, 1, receive from cloud-watch-cluster-sync and emit 
to cloud-watch-poller, receive from metrics-observer-cluster-sync and emit to 
metrics-observer)
service-event-dispatcher  (5, 1, receive from metrics-observer-service-sync and 
emit to metrics-observer)

cloud-watch-poller  (5, 1, receive from cloud-watch-cluster-sync and emit to 
metrics-observer)
metrics-observer  (10, 10, receive from pr-metrics-stream, 
server-metrics-stream, service-metrics-stream, service-event-dispatcher, 
cluster-event-dispatcher and cloud-watch-poller and emits to event-store and 
event-notifier)

service-status-aggregator  (5, 1, receive tuples from metrics-observer and emit 
to service-status-notifier)
service-status-notifier  (5, 1, receive from service-status-aggregator and 
write to RabbitMQ exchange)

cluster-status-aggregator  (5, 1, receive tuples from metrics-observer and 
cloud-watch-poller and emit to cluster-status-notifier)
cluster-status-notifier  (5, 1, receive from cluster-status-aggregator and 
write to RabbitMQ exchange)

metrics-store  (3, 3, write to elasticsearch)
pep-metrics-store  (7, 7, write to elasticsearch)

event-notifier  (5, 1, write to RabbitMQ exchange)
event-store  (5, 5, write to elasticsearch)

Re: Peeking into storm's internal buffers

Reply via email to