+1 for failure testing.  We have used other similar tools in the past to 
simulate different situations like network cuts, high packet loss, etc.  I 
would love to see more of this happen, and the scheduler get smart enough to 
detect these situations and deal with them.

- Bobby

From: "P. Taylor Goetz" <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, August 5, 2014 at 8:15 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: High CPU utilization after storm node failover

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along 
the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up 
a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] 
http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

On Aug 5, 2014, at 8:39 PM, Srinath C 
<[email protected]<mailto:[email protected]>> wrote:

I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1 
slot per node is the best approach.
Eager to know if there are any steps to further diagnose.




On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis 
<[email protected]<mailto:[email protected]>> wrote:
[Storm Version: 0.9.2-incubating]

Hello,

I am trying to test failover scenarios with my storm cluster. The following are 
the details of the cluster:

* 4 nodes
* Each node with 2 slots
* Topology with around 600 spouts and bolts
* Num. Workers for the topology = 4

I am running a test that generating a constant load. The cluster is able to 
handle this load fairly well and the CPU utilization at this point is below 50% 
on all the nodes. 1 slot is occupied on each of the nodes.

I then bring down one of the nodes (kill the supervisor and the worker 
processes on a node). After this, another worker is created on one of the 
remaining nodes. But the CPU utilization jumps up to 100%. At this point, 
nimbus cannot communicate with the supervisor on the node and keeps killing and 
restarting workers.

The CPU utilization remains pegged at 100% as long as the load is on. If I stop 
the tests and restart the test after a while, the same set up with just 3 nodes 
works perfectly fine with less CPU utilization.

Any pointers to how to figure out the reason for the high CPU utilization 
during the failover?

Thanks
Vinay



Reply via email to