Hmm, I am trying to figure out what I can share to reproduce this. I will try this with a simple topology and see if this can be reproduced. I will also try Srinath's approach of having only one worker/slot per node and having a spare. If that works, I would have a somewhat "launchable" scenario and I will have more time to investigate the high cpu utilization after failover.
Thanks Vinay On Wed, Aug 6, 2014 at 8:24 AM, Bobby Evans <[email protected]> wrote: > +1 for failure testing. We have used other similar tools in the past to > simulate different situations like network cuts, high packet loss, etc. I > would love to see more of this happen, and the scheduler get smart enough > to detect these situations and deal with them. > > - Bobby > > From: "P. Taylor Goetz" <[email protected]> > Reply-To: "[email protected]" < > [email protected]> > Date: Tuesday, August 5, 2014 at 8:15 PM > To: "[email protected]" <[email protected]> > Cc: "[email protected]" <[email protected]> > Subject: Re: High CPU utilization after storm node failover > > + dev@storm > > Vinyasa/Srinath, > > Anything you can share to make this reproducible would be very helpful. > > I would love to see a network partition simulation framework for Storm > along the lines of what Kyle Kingsbury has done with Jepsen [1]. It > basically sets up a virtual cluster then simulates network partitions by > manipulating iptables. > > Jepsen [2] is written in clojure and Kyle is a strong proponent. > > I think it is worth a look. > > -Taylor > > [1] > http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions > [2] https://github.com/aphyr/jepsen > > On Aug 5, 2014, at 8:39 PM, Srinath C <[email protected]> wrote: > > I have seen this behaviour too using 0.9.2-incubating. > The failover works better when there is a redundant node available. Maybe > 1 slot per node is the best approach. > Eager to know if there are any steps to further diagnose. > > > > > On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]> > wrote: > >> [Storm Version: 0.9.2-incubating] >> >> Hello, >> >> I am trying to test failover scenarios with my storm cluster. The >> following are the details of the cluster: >> >> * 4 nodes >> * Each node with 2 slots >> * Topology with around 600 spouts and bolts >> * Num. Workers for the topology = 4 >> >> I am running a test that generating a constant load. The cluster is >> able to handle this load fairly well and the CPU utilization at this point >> is below 50% on all the nodes. 1 slot is occupied on each of the nodes. >> >> I then bring down one of the nodes (kill the supervisor and the worker >> processes on a node). After this, another worker is created on one of the >> remaining nodes. But the CPU utilization jumps up to 100%. At this point, >> nimbus cannot communicate with the supervisor on the node and keeps killing >> and restarting workers. >> >> The CPU utilization remains pegged at 100% as long as the load is on. >> If I stop the tests and restart the test after a while, the same set up >> with just 3 nodes works perfectly fine with less CPU utilization. >> >> Any pointers to how to figure out the reason for the high CPU >> utilization during the failover? >> >> Thanks >> Vinay >> >> >> >
