Hmm, I am trying to figure out what I can share to reproduce this.

I will try this with a simple topology and see if this can be reproduced. I
will also try Srinath's approach of having only one worker/slot per node
and having a spare. If that works, I would have a somewhat "launchable"
scenario and I will have more time to investigate the high cpu utilization
after failover.

Thanks
Vinay


On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <[email protected]> wrote:

>  +1 for failure testing.  We have used other similar tools in the past to
> simulate different situations like network cuts, high packet loss, etc.  I
> would love to see more of this happen, and the scheduler get smart enough
> to detect these situations and deal with them.
>
>  - Bobby
>
>   From: "P. Taylor Goetz" <[email protected]>
> Reply-To: "[email protected]" <
> [email protected]>
> Date: Tuesday, August 5, 2014 at 8:15 PM
> To: "[email protected]" <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: High CPU utilization after storm node failover
>
>   + dev@storm
>
>  Vinyasa/Srinath,
>
>  Anything you can share to make this reproducible would be very helpful.
>
>  I would love to see a network partition simulation framework for Storm
> along the lines of what Kyle Kingsbury has done with Jepsen [1]. It
> basically sets up a virtual cluster then simulates network partitions by
> manipulating iptables.
>
>  Jepsen [2] is written in clojure and Kyle is a strong proponent.
>
>  I think it is worth a look.
>
>  -Taylor
>
>  [1]
> http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
> [2] https://github.com/aphyr/jepsen
>
> On Aug 5, 2014, at 8:39 PM, Srinath C <[email protected]> wrote:
>
>   I have seen this behaviour too using 0.9.2-incubating.
> The failover works better when there is a redundant node available. Maybe
> 1 slot per node is the best approach.
>  Eager to know if there are any steps to further diagnose.
>
>
>
>
> On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]>
> wrote:
>
>>  [Storm Version: 0.9.2-incubating]
>>
>>  Hello,
>>
>>  I am trying to test failover scenarios with my storm cluster. The
>> following are the details of the cluster:
>>
>>  * 4 nodes
>> * Each node with 2 slots
>> * Topology with around 600 spouts and bolts
>> * Num. Workers for the topology = 4
>>
>>  I am running a test that generating a constant load. The cluster is
>> able to handle this load fairly well and the CPU utilization at this point
>> is below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>>
>>  I then bring down one of the nodes (kill the supervisor and the worker
>> processes on a node). After this, another worker is created on one of the
>> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
>> nimbus cannot communicate with the supervisor on the node and keeps killing
>> and restarting workers.
>>
>>  The CPU utilization remains pegged at 100% as long as the load is on.
>> If I stop the tests and restart the test after a while, the same set up
>> with just 3 nodes works perfectly fine with less CPU utilization.
>>
>>  Any pointers to how to figure out the reason for the high CPU
>> utilization during the failover?
>>
>>  Thanks
>>  Vinay
>>
>>
>>
>

Reply via email to