Hey all,

We're having totally unpredictable issues with the flume master installation 
lately, here's what happened to us last night / today:

YESTERDAY
Yesterday we added 8 new nodes to flume. They got set-up fine, and the configs 
were registered.
a few hours later the master totally stops responding to anything 
(web/shell/nodes), I don't find out until this morning.

TODAY
I try to stop it using the init script, that doesn't do anything, and it 
continues to run, but be unresponsive
I kill -9 the flume processes, and remove the pid file, figuring I can just 
start it again

now the master won't start "master already running on pid=<non-existent-pid>"
when I finally get it to start (changing the pid directory), it starts being 
unresponsive again
restart it, it does the same
stop all flume-nodes, restart it, looks good, start the flume nodes, it goes 
unresponsive again
restart it, and this time it works


The only log above an INFO statement that I can see is this:
2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to load 
output format plugin class - Class not found


but I don't think that's causing the issues.


I do have a flume-node running on the same machine, could there be some sort of 
race condition happening?
Has anyone else seen behavior like this?
Any idea how to fix it?

Hoping someone can shed some light on this, I'm really not sure what's going on.

Thanks all 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma 
(http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)


Reply via email to