I've made sure all of our flume machines are on the same version, that doesn't 
seem to help. 

Whenever I start a new node, it takes down the master. This happens every 
single time. I have to restart everything in a very specific order to make sure 
it starts working again.

It's probably something to do with my use of the rpcSource, which causes issues 
in other areas too. 

-- 
Matthew Rathbone
Foursquare | Software Engineer | Server Engineering Team
matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma 
(http://twitter.com/rathboma) | 4sq (http://foursquare.com/rathboma)



On Sunday, August 28, 2011 at 9:36 AM, Bao Thai Ngo wrote:

> Mike,
> 
> I had the same problem with flume master. Try to remove flume and its init 
> script at Master machine, then re-install flume master again. Just remember 
> to save your configuration first.
> 
> Good luck.
> 
> ~Thai
> 
> On Fri, Aug 26, 2011 at 10:55 PM, Mike <mikethe...@gmail.com 
> (mailto:mikethe...@gmail.com)> wrote:
> > I'd also ensure that all nodes/masters/collectors/etc are using the
> >  precise same build of flume.
> > 
> >  On Fri, Aug 26, 2011 at 11:53 AM, Matthew Rathbone
> > <matt...@foursquare.com (mailto:matt...@foursquare.com)> wrote:
> > > Ah, I'm seeing this on single-master mode :-/. Anywhere else you think I
> > > could look for useful debugging output?
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma | 4sq
> > >
> > > On Friday, August 26, 2011 at 10:34 AM, Mike wrote:
> > >
> > > I did - but that was when we were testing multi-master mode, and since
> > > it's not fully matured yet, I've gone back to a single master.
> > >
> > > On Fri, Aug 26, 2011 at 11:32 AM, Matthew Rathbone
> > > <matt...@foursquare.com (mailto:matt...@foursquare.com)> wrote:
> > >
> > > You're right, there's another pid file there, that's crazy.
> > > Have you experienced the unresponsiveness thing too?
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma | 4sq
> > >
> > > On Friday, August 26, 2011 at 10:17 AM, Mike wrote:
> > >
> > > I recall a similar problem I had with this.
> > >
> > > It ended up being another pid-style file dropped somewhere else.
> > >
> > > /var/run/flume/flume-flume-master.pid
> > > /tmp/flumemaster.pid
> > >
> > > See if those are still around once all the flume procs are dead.
> > >
> > > -M
> > >
> > > On Fri, Aug 26, 2011 at 11:03 AM, Matthew Rathbone
> > > <matt...@foursquare.com (mailto:matt...@foursquare.com)> wrote:
> > >
> > > Hey all,
> > > We're having totally unpredictable issues with the flume master 
> > > installation
> > > lately, here's what happened to us last night / today:
> > > YESTERDAY
> > > Yesterday we added 8 new nodes to flume. They got set-up fine, and the
> > > configs were registered.
> > > a few hours later the master totally stops responding to anything
> > > (web/shell/nodes), I don't find out until this morning.
> > > TODAY
> > > I try to stop it using the init script, that doesn't do anything, and it
> > > continues to run, but be unresponsive
> > > I kill -9 the flume processes, and remove the pid file, figuring I can 
> > > just
> > > start it again
> > > now the master won't start "master already running on
> > > pid=<non-existent-pid>"
> > > when I finally get it to start (changing the pid directory), it starts 
> > > being
> > > unresponsive again
> > > restart it, it does the same
> > > stop all flume-nodes, restart it, looks good, start the flume nodes, it 
> > > goes
> > > unresponsive again
> > > restart it, and this time it works
> > >
> > > The only log above an INFO statement that I can see is this:
> > > 2011-08-26 14:38:34,527 WARN com.cloudera.flume.agent.FlumeNode: Unable to
> > > load output format plugin class - Class not found
> > > but I don't think that's causing the issues.
> > >
> > > I do have a flume-node running on the same machine, could there be some 
> > > sort
> > > of race condition happening?
> > > Has anyone else seen behavior like this?
> > > Any idea how to fix it?
> > > Hoping someone can shed some light on this, I'm really not sure what's 
> > > going
> > > on.
> > > Thanks all
> > > --
> > > Matthew Rathbone
> > > Foursquare | Software Engineer | Server Engineering Team
> > > matt...@foursquare.com (mailto:matt...@foursquare.com) | @rathboma | 4sq
> > >
> > >
> 

Reply via email to