Re: High CPU utilization after storm node failover

Bobby Evans Wed, 06 Aug 2014 08:25:55 -0700

+1 for failure testing.  We have used other similar tools in the past to 
simulate different situations like network cuts, high packet loss, etc.  I 
would love to see more of this happen, and the scheduler get smart enough to 
detect these situations and deal with them.

- Bobby

From: "P. Taylor Goetz" <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, August 5, 2014 at 8:15 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: High CPU utilization after storm node failover

+ dev@storm

Vinyasa/Srinath,

Anything you can share to make this reproducible would be very helpful.

I would love to see a network partition simulation framework for Storm along 
the lines of what Kyle Kingsbury has done with Jepsen [1]. It basically sets up 
a virtual cluster then simulates network partitions by manipulating iptables.

Jepsen [2] is written in clojure and Kyle is a strong proponent.

I think it is worth a look.

-Taylor

[1] 
http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
[2] https://github.com/aphyr/jepsen

On Aug 5, 2014, at 8:39 PM, Srinath C 
<[email protected]<mailto:[email protected]>> wrote:

I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1 
slot per node is the best approach.
Eager to know if there are any steps to further diagnose.

On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis 
<[email protected]<mailto:[email protected]>> wrote:
[Storm Version: 0.9.2-incubating]

Hello,

I am trying to test failover scenarios with my storm cluster. The following are 
the details of the cluster:

* 4 nodes
* Each node with 2 slots
* Topology with around 600 spouts and bolts
* Num. Workers for the topology = 4

I am running a test that generating a constant load. The cluster is able to 
handle this load fairly well and the CPU utilization at this point is below 50% 
on all the nodes. 1 slot is occupied on each of the nodes.

I then bring down one of the nodes (kill the supervisor and the worker 
processes on a node). After this, another worker is created on one of the 
remaining nodes. But the CPU utilization jumps up to 100%. At this point, 
nimbus cannot communicate with the supervisor on the node and keeps killing and 
restarting workers.

The CPU utilization remains pegged at 100% as long as the load is on. If I stop 
the tests and restart the test after a while, the same set up with just 3 nodes 
works perfectly fine with less CPU utilization.

Any pointers to how to figure out the reason for the high CPU utilization 
during the failover?

Thanks
Vinay

Re: High CPU utilization after storm node failover

Reply via email to