+1 for failure testing. We have used other similar tools in the past to simulate different situations like network cuts, high packet loss, etc. I would love to see more of this happen, and the scheduler get smart enough to detect these situations and deal with them.
- Bobby From: "P. Taylor Goetz" <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, August 5, 2014 at 8:15 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Cc: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: High CPU utilization after storm node failover + dev@storm Vinyasa/Srinath, Anything you can share to make this reproducible would be very helpful. I would love to see a network partition simulation framework for Storm along the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up a virtual cluster then simulates network partitions by manipulating iptables. Jepsen [2] is written in clojure and Kyle is a strong proponent. I think it is worth a look. -Taylor [1] http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions [2] https://github.com/aphyr/jepsen On Aug 5, 2014, at 8:39 PM, Srinath C <[email protected]<mailto:[email protected]>> wrote: I have seen this behaviour too using 0.9.2-incubating. The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach. Eager to know if there are any steps to further diagnose. On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]<mailto:[email protected]>> wrote: [Storm Version: 0.9.2-incubating] Hello, I am trying to test failover scenarios with my storm cluster. The following are the details of the cluster: * 4 nodes * Each node with 2 slots * Topology with around 600 spouts and bolts * Num. Workers for the topology = 4 I am running a test that generating a constant load. The cluster is able to handle this load fairly well and the CPU utilization at this point is below 50% on all the nodes. 1 slot is occupied on each of the nodes. I then bring down one of the nodes (kill the supervisor and the worker processes on a node). After this, another worker is created on one of the remaining nodes. But the CPU utilization jumps up to 100%. At this point, nimbus cannot communicate with the supervisor on the node and keeps killing and restarting workers. The CPU utilization remains pegged at 100% as long as the load is on. If I stop the tests and restart the test after a while, the same set up with just 3 nodes works perfectly fine with less CPU utilization. Any pointers to how to figure out the reason for the high CPU utilization during the failover? Thanks Vinay
