Re: [Pacemaker] Occasional nonsensical resource agent errors, redux

Andrei Borzenkov Mon, 03 Nov 2014 08:14:45 -0800

В Mon, 3 Nov 2014 15:26:34 +0100
Dejan Muhamedagic <deja...@fastmail.fm> пишет:


> Hi,
> 
> On Mon, Nov 03, 2014 at 08:46:00AM +0300, Andrei Borzenkov wrote:
> > В Mon, 3 Nov 2014 13:32:45 +1100
> > Andrew Beekhof <and...@beekhof.net> пишет:
> > 
> > > 
> > > > On 1 Nov 2014, at 11:03 pm, Patrick Kane <p...@wawd.com> wrote:
> > > > 
> > > > Hi all:
> > > > 
> > > > In July, list member Ken Gaillot reported occasional nonsensical 
> > > > resource agent errors using Pacemaker 
> > > > (http://oss.clusterlabs.org/pipermail/pacemaker/2014-July/022231.html).
> > > > 
> > > > We're seeing similar issues with our install.  We have a 2 node 
> > > > corosync/pacemaker failover configuration that is using the 
> > > > ocf:heartbeat:IPaddr2 resource agent extensively.  About once a week, 
> > > > we'll get an error like this, out of the blue:
> > > > 
> > > >   Nov  1 05:23:57 lb02 IPaddr2(anon_ip)[32312]: ERROR: Setup problem: 
> > > > couldn't find command: ip
> > > > 
> > > > It goes without saying that the ip command hasn't gone anywhere and all 
> > > > the paths are configured correctly.
> > > > 
> > > > We're currently running 1.1.10-14.el6_5.3-368c726 under CentOS 6 x86_64 
> > > > inside of a xen container.
> > > > 
> > > > Any thoughts from folks on what might be happening or how we can get 
> > > > additional debug information to help figure out what's triggering this?
> > > 
> > > its pretty much in the hands of the agent.
> > 
> > Actually the message seems to be output by check_binary() function
> > which is part of framework.  
> 
> Someone complained in the IRC about this issue (another resource
> agent though, I think Xen) and they said that which(1) was not
> able to find the program. I'd suggest to do strace (or ltrace)
> of which(1) at that point (it's in ocf-shellfuncs).
> 
> The which(1) utility is a simple tool: it splits the PATH
> environment variable and stats the program name appended to each
> of the paths. PATH somehow corrupted or filesystem misbehaving?
> My guess is that it's the former.
> 

As it is called quite often I'd instrument have_binary to dump all
environment and variables on "which" failure for known binary as well as
rerun it under strace. Running it under strace every time would
probably result in too copious output. 

> BTW, was there an upgrade of some kind before this started
> happening?
> 
> Thanks,
> 
> Dejan
> 
> > > you could perhaps find the call that looks for ip and wrap it in a set 
> > > -x/set +x block
> > > that way you'd know exactly why it thinks the binary is missing
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Occasional nonsensical resource agent errors, redux

Reply via email to