On Mon, May 10, 2010 at 11:05 AM, Dejan Muhamedagic <deja...@fastmail.fm> wrote: > Hi, > > On Fri, May 07, 2010 at 07:08:37PM +0200, Lars Ellenberg wrote: >> On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote: >> > Hi, >> > >> > On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote: >> > > Hi, >> > > >> > > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync >> > > on Debian Lenny. >> > > I used the latest package from the madkiss repo for the setup >> > > (corosync 1.2.0, pacemaker 1.0.8). >> > > >> > > I will spare you all the verbose config for now and just give you an >> > > overview the recource configuration: >> > > >> > > >gwa:~# crm_mon -1 >> > > >============ >> > > >Last updated: Fri May 7 12:10:19 2010 >> > > >Stack: openais >> > > >Current DC: gwb - partition with quorum >> > > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75 >> > > >2 Nodes configured, 2 expected votes >> > > >6 Resources configured. >> > > >============ >> > > > >> > > >Online: [ gwa gwb ] >> > > > >> > > > Master/Slave Set: drbd_disk >> > > > Masters: [ gwa ] >> > > > Slaves: [ gwb ] >> > > > Clone Set: connectivity >> > > > Started: [ gwb gwa ] >> > > > fencing_gwa (stonith:external/ipmi): Started gwb >> > > > fencing_gwb (stonith:external/ipmi): Started gwa >> > > > Resource Group: ips >> > > > ip_outside (ocf::heartbeat:IPaddr2): Started gwa >> > > > ip_backup (ocf::heartbeat:IPaddr2): Started gwa >> > > > ip_secure (ocf::heartbeat:IPaddr2): Started gwa >> > > > ip_inside (ocf::heartbeat:IPaddr2): Started gwa >> > > > ip_staging (ocf::heartbeat:IPaddr2): Started gwa >> > > > firewall (lsb:firewall): Started gwa >> > > > Resource Group: services >> > > > filesystem (ocf::heartbeat:Filesystem): Started gwa >> > > > openvpn (lsb:openvpn-cluster): Started gwa >> > > > dnsmasq (lsb:dnsmasq): Started gwa >> > > >> > > >> > > The cluster was running fairly stable for the past couple of weeks. >> > > >> > > But then Yesterday without any user interaction and while idle the >> > > active node (gwa) failed and was subsequently stonithed by the >> > > passive one (gwb) due to a strange error (at least to me) on allmost >> > > all resource agents: >> > > >> > > >gwa:~# grep -i error /var/log/syslog-20100507 >> > > >May 6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) >> > > >execl failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument >> > > >list too long >> >> man execve: >> E2BIG The total number of bytes in the environment (envp) and >> argument list (argv) is too large. >> >> line (raexecocf.c:execra:178) is >> execl(ra_pathname, ra_pathname, op_type, (const char *)NULL); >> >> so it is NOT the argument list, even though perror seems to >> thinks that's the more likely cause for this error. >> unless "op_type" happens to be an unterminated multi kB string somehow. >> (we know what ra_pathname is from the perror message). >> >> Does lrmd accumulate setenv() somehow? > > No, I don't think so. The set of environment variables is limited > to what is provided by the client in the message. > >> Or crmd sent to many parameters? > > If that's the case, then there is perhaps memory corruption in > crmd. Or the messaging layer. One unusual thing about the > configuration is that, obviously by accident, the start operation > on two fencing resources had non-zero interval.
Just in the config or did it make it into the lrmd like that? Because crmd/lrm.c has: if(op->interval != 0) { if(safe_str_eq(operation, CRMD_ACTION_START) || safe_str_eq(operation, CRMD_ACTION_STOP)) { crm_err("Start and Stop actions cannot have an interval: %d", op->interval); op->interval = 0; } } _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf