I'd also ensure that all nodes/masters/collectors/etc are using the precise same build of flume.
On Fri, Aug 26, 2011 at 11:53 AM, Matthew Rathbone <matt...@foursquare.com> wrote: > Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I > could look for useful debugging output? > -- > Matthew Rathbone > Foursquare | Software Engineer | Server Engineering Team > matt...@foursquare.com | @rathboma | 4sq > > On Friday, August 26, 2011 at 10:34 AM, Mike wrote: > > I did - but that was when we were testing multi-master mode, and since > it's not fully matured yet, I've gone back to a single master. > > On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone > <matt...@foursquare.com> wrote: > > You're right, there's another pid file there, that's crazy. > Have you experienced the unresponsiveness thing too? > -- > Matthew Rathbone > Foursquare | Software Engineer | Server Engineering Team > matt...@foursquare.com | @rathboma | 4sq > > On Friday, August 26, 2011 at 10:17 AM, Mike wrote: > > I recall a similar problem I had with this. > > It ended up being another pid-style file dropped somewhere else. > > /var/run/flume/flume-flume-master.pid > /tmp/flumemaster.pid > > See if those are still around once all the flume procs are dead. > > -M > > On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone > <matt...@foursquare.com> wrote: > > Hey all, > We're having totally unpredictable issues with the flume master installation > lately, here's what happened to us last night / today: > YESTERDAY > Yesterday we added 8 new nodes to flume. They got set-up fine, and the > configs were registered. > a few hours later the master totally stops responding to anything > (web/shell/nodes), I don't find out until this morning. > TODAY > I try to stop it using the init script, that doesn't do anything, and it > continues to run, but be unresponsive > I kill -9 the flume processes, and remove the pid file, figuring I can just > start it again > now the master won't start "master already running on > pid=<non-existent-pid>" > when I finally get it to start (changing the pid directory), it starts being > unresponsive again > restart it, it does the same > stop all flume-nodes, restart it, looks good, start the flume nodes, it goes > unresponsive again > restart it, and this time it works > > The only log above an INFO statement that I can see is this: > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to > load output format plugin class - Class not found > but I don't think that's causing the issues. > > I do have a flume-node running on the same machine, could there be some sort > of race condition happening? > Has anyone else seen behavior like this? > Any idea how to fix it? > Hoping someone can shed some light on this, I'm really not sure what's going > on. > Thanks all > -- > Matthew Rathbone > Foursquare | Software Engineer | Server Engineering Team > matt...@foursquare.com | @rathboma | 4sq > >