Update:

I tested the scenario that Srinath proposed.

* 4 nodes (lets say A, B, C, D) - each node with 1 slot/worker each.
* Topology started with 3 workers. (Lets say A, B, C)
* 1 node was spare. (D)

Once the load was stabilized, I killed one of the nodes on which the
topology was running (node A). The following things happened:

* New worker was created on the previously spare node D.
* The events from the load test continued to get processed.
* But CPU utilization on nodes B and C (from the original set) shot up to
100%
* Interesting thing was that it was the "system cpu time" that went up as
opposed to "user cpu time"
* It stayed at 100% for a few minutes. Then the workers on nodes B and C
died.
* The supervisor restarted the workers and the CPU utilization dropped back
to normal.
* Nodes B & C both had this exception (see below) when the worker died.
Looks like they were trying to contact Node A that was brought down.
* Once the new workers were started, the CPU came down to normal.

Any ideas as to what might be happening?

Thanks
Vinay

*ERROR 2014-08-06 18:59:45,373 [b.s.util
Thread-427-disruptor-worker-transfer-queue] Async loop died!*
*java.lang.RuntimeException: java.lang.RuntimeException: Remote address is
not reachable. We will close this client Netty-Client-*
*NODE-A/NODE-A-IP:6700*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]*
* at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]*
*Caused by: java.lang.RuntimeException: Remote address is not reachable. We
will close this client Netty-Client-NODE-A/NODE-A-IP:6700*
* at backtype.storm.messaging.netty.Client.connect(Client.java:166)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.messaging.netty.Client.send(Client.java:203)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]*
* ... 6 common frames omitted*







On Wed, Aug 6, 2014 at 9:02 AM, Vinay Pothnis <[email protected]>
wrote:

> Hmm, I am trying to figure out what I can share to reproduce this.
>
> I will try this with a simple topology and see if this can be reproduced.
> I will also try Srinath's approach of having only one worker/slot per node
> and having a spare. If that works, I would have a somewhat "launchable"
> scenario and I will have more time to investigate the high cpu utilization
> after failover.
>
> Thanks
> Vinay
>
>
> On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <[email protected]> wrote:
>
>>  +1 for failure testing.  We have used other similar tools in the past
>> to simulate different situations like network cuts, high packet loss, etc.
>>  I would love to see more of this happen, and the scheduler get smart
>> enough to detect these situations and deal with them.
>>
>>  - Bobby
>>
>>   From: "P. Taylor Goetz" <[email protected]>
>> Reply-To: "[email protected]" <
>> [email protected]>
>> Date: Tuesday, August 5, 2014 at 8:15 PM
>> To: "[email protected]" <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Subject: Re: High CPU utilization after storm node failover
>>
>>   + dev@storm
>>
>>  Vinyasa/Srinath,
>>
>>  Anything you can share to make this reproducible would be very helpful.
>>
>>  I would love to see a network partition simulation framework for Storm
>> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
>> basically sets up a virtual cluster then simulates network partitions by
>> manipulating iptables.
>>
>>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>>
>>  I think it is worth a look.
>>
>>  -Taylor
>>
>>  [1]
>> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>> [2] https://github.com/aphyr/jepsen
>>
>> On Aug 5, 2014, at 8:39 PM, Srinath C <[email protected]> wrote:
>>
>>   I have seen this behaviour too using 0.9.2-incubating.
>> The failover works better when there is a redundant node available. Maybe
>> 1 slot per node is the best approach.
>>  Eager to know if there are any steps to further diagnose.
>>
>>
>>
>>
>> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]>
>> wrote:
>>
>>>  [Storm Version: 0.9.2-incubating]
>>>
>>>  Hello,
>>>
>>>  I am trying to test failover scenarios with my storm cluster. The
>>> following are the details of the cluster:
>>>
>>>  * 4 nodes
>>> * Each node with 2 slots
>>> * Topology with around 600 spouts and bolts
>>> * Num. Workers for the topology = 4
>>>
>>>  I am running a test that generating a constant load. The cluster is
>>> able to handle this load fairly well and the CPU utilization at this point
>>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>>
>>>  I then bring down one of the nodes (kill the supervisor and the worker
>>> processes on a node). After this, another worker is created on one of the
>>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>>> nimbus cannot communicate with the supervisor on the node and keeps killing
>>> and restarting workers.
>>>
>>>  The CPU utilization remains pegged at 100% as long as the load is on.
>>> If I stop the tests and restart the test after a while, the same set up
>>> with just 3 nodes works perfectly fine with less CPU utilization.
>>>
>>>  Any pointers to how to figure out the reason for the high CPU
>>> utilization during the failover?
>>>
>>>  Thanks
>>>  Vinay
>>>
>>>
>>>
>>
>

Reply via email to