Hey all, We're having totally unpredictable issues with the flume master installation lately, here's what happened to us last night / today:
YESTERDAY Yesterday we added 8 new nodes to flume. They got set-up fine, and the configs were registered. a few hours later the master totally stops responding to anything (web/shell/nodes), I don't find out until this morning. TODAY I try to stop it using the init script, that doesn't do anything, and it continues to run, but be unresponsive I kill -9 the flume processes, and remove the pid file, figuring I can just start it again now the master won't start "master already running on pid=<non-existent-pid>" when I finally get it to start (changing the pid directory), it starts being unresponsive again restart it, it does the same stop all flume-nodes, restart it, looks good, start the flume nodes, it goes unresponsive again restart it, and this time it works The only log above an INFO statement that I can see is this: 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to load output format plugin class - Class not found but I don't think that's causing the issues. I do have a flume-node running on the same machine, could there be some sort of race condition happening? Has anyone else seen behavior like this? Any idea how to fix it? Hoping someone can shed some light on this, I'm really not sure what's going on. Thanks all -- Matthew Rathbone Foursquare | Software Engineer | Server Engineering Team matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma (http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)