Re: [Linux-HA] odd cluster failure

2017-02-09 Thread Greg Woods
On Thu, Feb 9, 2017 at 2:28 AM, Ferenc Wágner  wrote:

> Looks like your VM resource was destroyed (maybe due to the xen balloon
> errors above), and the monitor operation noticed this.
>

Thank you for helping me interpret that. I think what happened is that the
VM in question (radnets) is the only one that did not have maxmem specified
in the config file. It probably suffered memory pressure and the hypervisor
tried to give it more memory, but ballooning is turned off in the
hypervisor. That's probably where the balloon errors come from. The VM
probably got hung up because it ran out of memory, causing the monitor to
fail.

There is a little guesswork going on here, because I do not fully
understand how Xen ballooning works (or is supposed to work), but it seems
like I should set maxmem for this VM like all the others, and I increased
it's available memory as well. Now I just wait and see if it happens again.

--Greg
___
Linux-HA mailing list is closing down.
Please subscribe to us...@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
___
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha

[Linux-HA] odd cluster failure

2017-02-03 Thread Greg Woods
For the second time in a few weeks, we have had one node of a particular
cluster getting fenced. It isn't totally clear why this is happening. On
the surviving node I see:

Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:49:00 vmc1 kernel: igb :03:00.1 eth3: igb: eth3 NIC Link is
Down
Feb  2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
Feb  2 16:49:01 vmc1 corosync[2846]:   [TOTEM ] A processor failed, forming
new configuration.

OK, so from this point of view, it looks like the link was lost between the
two hosts, resulting in fencing. The link is a crossover cable, so no
networking hardware other than the host NICs and the cable.

On the other side I see:

Feb  2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:49 vmc2 crmd[4191]:   notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: fail-count-VM-radnets (1)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 37:
fail-count-VM-radnets=1
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: last-failure-VM-radnets (1486079209)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 39:
last-failure-VM-radnets=1486079209
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: On loss of CCM Quorum: Ignore
Feb  2 16:46:50 vmc2 pengine[4190]:  warning: Processing failed op monitor
for VM-radnets on vmc2.ucar.edu: not running (7)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Recover
VM-radnets#011(Started vmc2.ucar.edu)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Calculated Transition 2914:
/var/lib/pacemaker/pengine/pe-input-317.bz2
Feb  2 16:46:50 vmc2 crmd[4191]:   notice: Initiating action 15: stop
VM-radnets_stop_0 on vmc2.ucar.edu (local)
Feb  2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will
be stopped (timeout: 80s)
Feb  2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
Feb  2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is
not ready
Feb  2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10,
protocol 1 (x86_64-abi)
Feb  2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
Feb  2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link
becomes ready
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state

 (and then there are a bunch of null bytes, and the log resumes with reboot)

More messages about networking, except that xenbr1 is not the bridge device
associated with the NIC in question.

I don't see any reason why the link between the hosts should suddenly stop
working, so I am suspecting a hardware problem that only crops up rarely
(but will most likely get worse over time).
Is there anything anyone can see in the log that would suggest otherwise?

Thank you,
--Greg
___
Linux-HA mailing list is closing down.
Please subscribe to us...@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
___
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha


[Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
I notice that the network:ha-clustering:Stable repo for CentOS 6 now
contains Corosync 2.3.3-1 . I am currently running 1.4.1-17 . Is it safe to
just run this update? Are there configuration changes I have to make in
order for the new version to work? (If there is a document or wiki page
describing how to convert from Corosync 1 to 2, I would be happy to be
pointed to it).

Thanks,
--Greg
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
On Wed, Oct 1, 2014 at 8:44 AM, Digimer li...@alteeve.ca wrote:



 Personally, I would not upgrade. If you do, you will want to test outside
 of production first.


Of course, I would always do that anyway,  even without a major version
 number change.


 Corosync needed cman to be a quorum provider in the 1.x series. In 2.x, it
 became it's own quorum provider and cman was no longer needed. Last I heard
 upstream, pacemaker on EL6 is only supported on corosync 1.4 + cman.


There is a pacemaker update too, to   1.1.12+git20140723.483f48a-1.1

I'm sure you're not concerned about paid support, but it does mean that the
 corosync 1.4 stack is much better tested on EL6 than 2.x is.


OK, thanks. What I am really trying to figure out is exactly what the
 network_ha-clustering_Stable repo is for. Presumably, from the name, they
wouldn't put anything in there that isn't ready for production?

--Greg
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
On Wed, Oct 1, 2014 at 2:04 PM, Digimer li...@alteeve.ca wrote:



 Who runs the repo? It's not a name I am familiar with.


It comes from opensuse.org . I'm pretty sure I got it out of one of the
documents on the clusterlabs site, but I would have to go back and verify
that to be certain.

--Greg
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple colocation with same resource group

2014-02-21 Thread Greg Woods
On Fri, 2014-02-21 at 12:37 +, Tony Stocker wrote:

  colocation inf_ftpd inf: infra_group ftpd
 
 or do I need to use an 'order' statement instead, i.e.:
 
  order ftp_infra mandatory: infra_group:start ftpd

I'm far from a leading expert on this, but in my experience, colocation
and order are completely separate concepts. If you want both, you have
to state both. So I would say you need both colocation and order
statements to get what you want. I have similar scenarios, where virtual
machines depend on the underlying DRBD device, so I colocate the DRBD
master-slave resource and the VM resource, then have an order statement
to say the DRBD resource must be started before the VM.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd disks in secondary/secondary diskless/diskless mode

2013-08-14 Thread Greg Woods
On 08/14/2013 02:12 PM, Fredrik Hudner wrote:

 I have tried to make one node primary but only get:

 0: State change failed: (-2) Need access to UpToDate data
 Command 'drbdsetup primary 0' terminated with exit code 17

When you've suffered a sudden disconnect, you can get into a situation
where both sides think their information is outdated. To recover, you
have to tell the cluster which node can throw away its data in favor of
what the other node has.

http://www.drbd.org/users-guide/s-resolve-split-brain.html

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I have two clusters that are both running CentOS 5.6 and
heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running
slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the
first one and pacemaker-1.0.12-1.el5 on the other) They both have
identical ha.cf files except that the bcast device names are different
(and they are correct for each case, I checked), like this:

udpport 694
bcast eth2
bcast eth1
use_logd off
logfile /var/log/halog
debugfile /var/log/hadebug
debug 1
keepalive 2
deadtime 15
initdead 60
node vmd1.ucar.edu
node vmd2.ucar.edu
auto_failback off
respawn hacluster /usr/lib64/heartbeat/ipfail
crm respawn

On one of them (which maybe or maybe not coincidentally is having some
problems), I get these messages logged about every 2 seconds
in /var/log/halog, on the other I don't see them:

May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
message with 10 fields
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
[t=NS_ackmsg]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
[dest=vmx2.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
[ackseq=3a0]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
[(1)destuuid=0x5ceb280(37 28)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[hg=4c97c17a]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[ts=51a13435]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] : [ttl=3]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] : [auth=1
23b556bcb61a08abecf87cb6411c62e62cf99f0d]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG: Dumping
message with 12 fields
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[0] :
[t=status]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[1] :
[st=active]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[2] :
[dt=3a98]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[3] :
[protocol=1]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[seq=17b]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[hg=4c97c17a]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
[ts=51a13435]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
[ld=0.27 0.41 0.26 1/315 19183]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
[ttl=3]
May 25 15:59:17 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
[auth=1 3d3da4df831636f7c274395041ffb49bbf215170]

The questions are what do these messages actually mean, why is one
cluster logging them and not the other, and is this something I should
be worried about?

Thanks for any info,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I know it's tacky to reply to myself, but I can answer one of my
questions after another 15 minutes or so of poring through logs:

On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote:

 
 The questions are what do these messages actually mean, why is one
 cluster logging them and not the other, and is this something I should
 be worried about?

The answer to the last one is that this is definitely a problem, because
after nearly half an hour, this is logged:

May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[4] :
[src=vmx1.ucar.edu]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[5] :
[(1)srcuuid=0x5ceb390(36 27)]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[6] :
[seq=3a4]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[7] :
[hg=4c97c17a]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[8] :
[ts=51a13888]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[9] :
[ld=0.50 0.33 0.28 3/316 13859]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[10] :
[ttl=3]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: MSG[11] :
[auth=1 feb94da356847a538290ea75f27423c996c0a595]
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5689]: ERROR: write_child:
Exiting due to persistent errors: No such device
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBWRITE
process 5689 exited with return code 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: HBWRITE process
died.  Beginning communications restart process for comm channel 1.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: WARN: Managed HBREAD
process 5690 killed by signal 9 [SIGKILL - Kill, unblockable].
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: ERROR: Both comm
processes for channel 1 have died.  Restarting.
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat started on port 694 (694) interface eth4
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: glib: UDP
Broadcast heartbeat closed on port 694 interface eth4 - Status: 1
May 25 16:17:44 vmx1.ucar.edu heartbeat: [5683]: info: Communications
restart succeeded.
May 25 16:17:45 vmx1.ucar.edu heartbeat: [5683]: info: Link
vmx2.ucar.edu:eth4 up.

And VMs stop being reachable, etc. The only way to stabilize things is
to not start heartbeat on one of the nodes (vmx1 arbitrarily chosen) and
run all resources on a single node (vmx2 in this case).

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote:

  respawn hacluster /usr/lib64/heartbeat/ipfail
  crm respawn
 
 I don't know about the rest, but definitely do not use both ipfail and crm.
 Pick one :)

I guess I will have to look into what ipfail really does. I have a half
dozen clusters that have virtually the same ha.cf files and they have
been running for 2+ years with it specified this way.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: vm live migration without shared storage

2013-05-24 Thread Greg Woods
On Fri, 2013-05-24 at 10:45 +0200, Ulrich Windl wrote:

 
 You are still mixing total migration time (which may be minutes) with virtual 
 stand-still time (which is a few seconds).


Correct. It was not clear (to me) that when the time to migrate was
several minutes, the actual service outage was only a few seconds. This
point has now been made (several times), and it is a big difference.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] vm live migration without shared storage

2013-05-23 Thread Greg Woods
On Thu, 2013-05-23 at 15:00 -0400, David Vossel wrote:
  Migration time, depending on network speed and hardware, is much longer than 
 the shared storage option (minutes vs. seconds).  


This is just one data point (of course), but for the vast majority of
services that I run, if the live migration time is as long as it takes
to shut down a VM and boot it on another server, then there isn't much
of an advantage to doing the live migration. Especially if we're talking
about an option that is a long way from being battle-tested, and
critical services such as DNS and authentication. Most of these critical
services do not use long-lived connections.

I can see a few VMs that exist to provide ssh logins where a
minutes-long live migration would be clearly preferable to a shut down
and reboot, but in most cases, if it's as slow as rebooting, it isn't
going to be any advantage to me.

It will be interesting though to see how many applications people come
up with where a minutes-long live migration is preferable to shutdown
and reboot.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: DRBD NetworkFailure

2013-04-24 Thread Greg Woods
On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote:
  Greg Woods wo...@ucar.edu schrieb am 23.04.2013 um 21:20 in Nachricht

  Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
  Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
  Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure -
  Unconnected ) 

 
 You chould use ethtool to check the interface statistics; otherwise I'd vote 
 for a software issue...

Ethtool doesn't show any errors, but it's possible that the errors don't
start occurring until just before DRBD detects the issue. Unfortunately
I can't access the system once the problems start occurring so I can't
run ethtool at that point.

If it's a software issue, what is it likely to be? I have to find some
way to debug this, I'm getting some flak about the outages this is
causing, even though, so far, they have been three weeks apart. And it
won't be long before this happens at 3AM, which will really suck.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] drbd error message decoding help

2013-04-24 Thread Greg Woods
On Wed, 2013-04-24 at 12:11 +0200, Lars Ellenberg wrote:

  
  drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2:
  Calling /usr/sbin/crm_master -Q -l reboot -v 1

I apologize for the noise about this. Further checks of the logs on all
my clusters show that this is normal behavior. I started a different
thread DRBD NetworkFailure which is hopefully closer to the mark.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clean shutdown procedure?

2013-04-23 Thread Greg Woods
On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote:
 On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote:
  On Saturday, April 20, 2013, Greg Woods wrote:
   Often one of the
   nodes gets stuck at Stopping HA Services
  
  
  That means pacemaker is waiting for one of your resources to stop.
  Do you have anything that would take a long time (or fail to stop)?
 
 Not that I am aware of. But some things that came up during this
 weekend's powerdown make me think that some of the stop actions are
 failing

This particular issue has been solved. It turns out that this is one of
those perfect storm situations. Because of the coming powerdown, our
HPSS (High Performance Storage System) was shut down several hours prior
to the HA clusters going down. The HA clusters do not directly depend on
the HPSS, but they do run backups to it. The incremental backup script
works by taking an LVM snapshot of the logical volume that the file
system containing the virtual machine images is mounted on, then
mounting the virtual disk images from the snapshot, and finally, running
our standard system backup script on the mounted images. The system
backup script will normally run a find on the file system(s) to be
backed up, and package it up into multiple cpio archives (as many as it
takes for either the full file system or just the files that have
changed in the past two days). Once an archive file has been created, it
gets sent to the HPSS. It turns out that the script will try multiple
times to send the file if the first attempt fails, which can actually
cause it to continue running and retrying for many hours. While it is
running, the snapshot is still in place. The cluster resource stop
failed on one of the LVM resources, saying that the volume group could
not be deactivated because there was still an active logical volume. The
snapshot. So that caused the fence. 

This still doesn't fully explain the original issue of why the shutdown
process can hang trying to stop the heartbeat service. Or does it? Since
I wasn't looking for this, I can't be certain that the HPSS wasn't
offline during the times I have observed these hangs, so I'll have to
start checking for that. In the meantime, I'll have to create a shutdown
script that checks for a hung backup, kills it, and deletes the snapshot
before issuing the /sbin/shutdown command.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] DRBD NetworkFailure

2013-04-23 Thread Greg Woods
Here's a new issue. We have had two outages, about 3 weeks apart, on one
of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this
was logged:

Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in
time.
Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary - Unknown )
conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) 
Apr 19 17:02:22 vmn2 kernel: block drbd0: asender terminated
Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( NetworkFailure -
Unconnected ) 
Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver terminated
Apr 19 17:02:22 vmn2 kernel: block drbd0: Restarting receiver thread
Apr 19 17:02:22 vmn2 kernel: block drbd0: receiver (re)started
Apr 19 17:02:22 vmn2 kernel: block drbd0: conn( Unconnected -
WFConnection ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: PingAck did not arrive in
time.
Apr 19 17:02:27 vmn2 kernel: block drbd1: peer( Secondary - Unknown )
conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: new current UUID
37CF642BD875CB67:901912BD41972B81:FC8B5D00E5B5988E:FC8A5D00E5B5988F
Apr 19 17:02:27 vmn2 kernel: block drbd1: asender terminated
Apr 19 17:02:27 vmn2 kernel: block drbd1: Terminating asender thread
Apr 19 17:02:27 vmn2 kernel: block drbd1: Connection closed
Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( NetworkFailure -
Unconnected ) 
Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver terminated
Apr 19 17:02:27 vmn2 kernel: block drbd1: Restarting receiver thread
Apr 19 17:02:27 vmn2 kernel: block drbd1: receiver (re)started
Apr 19 17:02:27 vmn2 kernel: block drbd1: conn( Unconnected -
WFConnection ) 

This looks like a long-winded way of saying that the DRBD devices went
offline due to a network failure. One time this was logged on one node,
and the other time it was logged on the other node, so that would seem
to rule out any issue internal to one node (such as bad memory). In both
cases, nothing else is logged in any of the HA logs or
the /var/log/messages file. Obviously, the VMs stop providing services
and this is how the problem is noticed (DNS server not responding,
etc.). It doesn't appear that Pacemaker or Heartbeat ever even notices
that anything is wrong, since nothing is logged after the above until
the restart messages when I finally cycle the power via IPMI (which was
almost half an hour later). The two nodes are connected by a crossover
cable, and that is the link used for DRBD replication. So it seems as
though the only possibilities are a flaky NIC or a flaky cable, but in
that case, wouldn't I see some sort of hardware error logged? Anybody
else ever seen something like this?

Thanks,
--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] clean shutdown procedure?

2013-04-19 Thread Greg Woods
On Fri, 2013-04-19 at 16:43 +0200, Florian Crouzat wrote:
 crm configure property 

OK, thanks for the suggestions. What is the difference between
maintenance-mode=true and stop-all-resources=true? I tried the
latter first, and all the resources do stop, except that all the stonith
resources are still running. I'm just worried about the possibility of a
STONITH death match occurring at the next reboot; I'd rather see the
stonith resources stopped too. Or is there some reason why that would
not be desirable?

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] drbd error message decoding help

2013-04-19 Thread Greg Woods
I realize that nobody can solve a problem based on a single log entry,
but I am trying to understand what happened with a cluster problem
today. A similar thing happened with this cluster about 3 weeks ago, so
this is one of those hard-to-solve intermittent issues. But it might
help me now if I understood better what this message actually means:

drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2:
Calling /usr/sbin/crm_master -Q -l reboot -v 1

It looks like a drbd process is calling a CRM process? Or is it the
other way around (which would make more sense?)

Thanks,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Greg Woods
On Sun, 2013-03-24 at 01:36 -0700, tubaguy50035 wrote:

 params ipv6addr=2600:3c00::0034:c007 nic=eth0:3 \

Are you sure that's a valid IPV6 address? I get headaches every time I
look at these, but it seems a valid address is 8 groups, and you've got
5 there. Maybe you mean 2600:3c00::0034:c007?

--Greg

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Using a Ping Daemon (or Something Better) to PreventSplit Brain

2013-01-31 Thread Greg Woods
On Thu, 2013-01-31 at 02:09 +, Robinson, Eric wrote:

  the secondary should wait for a manual command to become primary. 

That can be accomplished with the meatware STONITH device. Requires a
command to be run to tell the wannabe primary that the secondary is
really dead (and, of course, you had better be sure that the secondary
is really dead before the command is run to avoid split brain).


 
 2. The secondary should refuse to become primary even if manually ordered to 
 do so if it cannot communcate with DataCenterC.

I don't know any way to do that exactly, but you might be able to use
order constraints to require some sort of ping-based resource to be
successfully started before the other resources can start.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to diagnose stonith death match?

2013-01-09 Thread Greg Woods
On Thu, 2013-01-10 at 08:35 +1100, Andrew Beekhof wrote:
 On Wed, Jan 9, 2013 at 4:16 PM, Greg Woods wo...@ucar.edu wrote:

   I got the cluster running with xend by
   moving the heartbeat to a different interface.
 
  Having heartbeat start after the bridge is created _should_ also work.
 
 
  Obviously that can't work if xend is a cluster resource.
 
 Can you split up the networking part from the other pieces?

Not without hacking around in the Xen script files (which of course are
part of the distro's packages and my changes would get overwritten every
time I had to update). It is easier to just use an interface that
doesn't have a Xen bridge on it for heartbeat.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to diagnose stonith death match?

2013-01-08 Thread Greg Woods
On Tue, 2013-01-08 at 09:18 +1100, Andrew Beekhof wrote:

  On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote:
 
  The problem is that either node can come up and run all the resources,
  but as soon as I bring the other node online, it briefly looks normal,
  but as soon as the stonith resource starts, the currently running node
  gets fenced and the new node takes over all the resources. Then the
  fenced node comes up, fences the other node and takes over, etc. Death
  match.

 Thats odd. Normally its a firewall issue.  Did you happen to choose a
 different port perhaps?

Close, but not quite. I did finally figure out what was going on, as the
death match started again as I was reconfiguring the cluster from
scratch, but this time I knew more about what was causing it. It started
as soon as I added xend as a resource. A little trial and error showed
that the heartbeat does not work if it is on an interface that also has
a Xen bridge attached to it. This is unexpected because all the other
kinds of networking on that interface work fine with the bridge active
(e.g. ssh connections, IPMI connections, etc.), only heartbeat is
affected. But it was absolutely reproducible. If I started xend by hand
instead of having it as a cluster resource, again I got a death match. A
careful reading of the logs did show that heartbeat was declaring the
other node dead. So for some reason, heartbeat communication was lost as
soon as the bridge was activated. I got the cluster running with xend by
moving the heartbeat to a different interface. This is less than ideal
because that interface is attached to a network that is also used for
different things and has other hosts attached to it, but since this is
only a test cluster, that's acceptable.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to diagnose stonith death match?

2013-01-08 Thread Greg Woods
On Wed, 2013-01-09 at 13:15 +1100, Andrew Beekhof wrote:

 IIRC, part of the activation involves tearing down the normal
 interface and creating the bridge.
 At this point the device heartbeat was talking to is gone.

I hadn't thought of that, because afterwards, ethX looks exactly the
same as it did before, same IP and other settings. It just has xenbrX
attached to it. But I admit I don't know exactly what happens there.
 
  I got the cluster running with xend by
  moving the heartbeat to a different interface.
 
 Having heartbeat start after the bridge is created _should_ also work.


Obviously that can't work if xend is a cluster resource. I suppose xend
could be started outside the cluster before heartbeat, but then I don't
get to have it monitored by Pacemaker.

So this will be in the archives as a warning to people running clusters
for Xen virtual machines (or anything else that sets up bridged
networking). In my case, the only solution is to use an interface for
heartbeat that is not touched by Xen networking. I suppose people who
are using something other than bridged networking may not have this
issue either.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] how to diagnose stonith death match?

2013-01-03 Thread Greg Woods
On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote:

 The problem is that either node can come up and run all the resources,
 but as soon as I bring the other node online, it briefly looks normal,
 but as soon as the stonith resource starts, the currently running node
 gets fenced and the new node takes over all the resources. Then the
 fenced node comes up, fences the other node and takes over, etc. Death
 match.


After spending way too much time on this, I finally gave up, completely
removed and reinstalled heartbeat and pacemaker, cleared out the
contents of /var/lib/heartbeat/crm, and reconfigured the cluster from
scratch. It is now working. I don't have all the resources in yet, but I
believe it will work properly when I am done.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Some novice questions?

2013-01-02 Thread Greg Woods
On Tue, 2013-01-01 at 14:58 +0330, Ali Masoudi wrote:

 Is it mandatory to use same ha.cf on both nodes? 

I don't think it is absolutely mandatory, but it is best practice.
Unless you really know what you are doing, you can run into difficulties
getting heartbeat to work properly if the ha.cf files are different.

 if names
 of network interfaces are differenet, what is best to do?

I have never run a cluster where this was so. Since my hardware is
identical on both nodes for all of my clusters, so are the network
interface names. I imagine you could get it to work if the ha.cf files
were the same except for the network interface names, but I haven't
tried this.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Some novice questions?

2012-12-31 Thread Greg Woods
On Mon, 2012-12-31 at 15:09 +0330, Ali Masoudi wrote:

 ucast eth3 192.168.50.17

If you are using ucast, then you need one line for each node's IP in the
ha.cf file. Either that or different ha.cf files on each node. What is
needed is the IP of the other node, but heartbeat is smart enough to
ignore ucast IP's that refer to the node it is running on, so the usual
practice is to include two ucast lines, one for each node's IP, so that
you can use the same ha.cf file on both nodes.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] how to diagnose stonith death match?

2012-12-28 Thread Greg Woods
I did some reconfiguration of the NICs and IP addresses on my 2-node
test cluster (running heartbeat and Pacemaker on CentOS 5, slightly old
versions but they have been working fine up to now on this and several
other clusters). I am sure that the NIC configuration is correct and
that the CIB has the correct modified data in it. Also the ha.cf file is
correct. (I even tried switching from bcast to ucast, but that did not
change the behavior).

The problem is that either node can come up and run all the resources,
but as soon as I bring the other node online, it briefly looks normal,
but as soon as the stonith resource starts, the currently running node
gets fenced and the new node takes over all the resources. Then the
fenced node comes up, fences the other node and takes over, etc. Death
match.

What I am looking for is just a hint about how to diagnose this. I have
tried looking in the log file, but as everyone knows, those logs are
incredibly voluminous, so I would like a hint about what to look for to
diagnose this.

Thank you,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Custom resource agent script assistance

2011-12-01 Thread Greg Woods
On Thu, 2011-12-01 at 13:25 -0400, Chris Bowlby wrote:
 Hi Everyone, 
 
 I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster

This doesn't really address your specific question, but I got dhcpd to
work by using the ocf:heartbeat:anything RA.

primitive dhcp ocf:heartbeat:anything \
params binfile=/usr/sbin/dhcpd cmdline_options=-f
-cf /vmgroup2/rep/dhcpd.conf -lf /vmgroup2/rep/dhcpd/dhcpd.leases \
op monitor interval=10 timeout=50 depth=0 \
op start interval=0 timeout=90s \
op stop interval=0 timeout=100s \
meta target-role=Started

The -cf and -lf arguments are just to ensure that the config file
and the leases file are located within a DRBD-replicated partition.

No doubt 10 people will surface to explain why this is a horrible way to
do it, but it does work.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Monitoring only across WAN

2011-06-20 Thread Greg Woods
On Mon, 2011-06-20 at 17:47 +0800, Emmanuel Noobadmin wrote:
 The objective is to achieve sub minute monitoring of services like
 httpd and exim/dovecot so that I can run a script to notify/SMS myself
 when one of the machines fails to respond. Right now I'm just running
 a cron script every few minutes to ping the servers are but the
 problem is that I discovered that the server could respond to pings
 while services are dead to the world.

It sounds like HA may be the wrong tool for what you want. You might be
better off with some type of monitoring/notification tool such as
Nagios. Those tools can do more than just ping, they can connect to the
web server and verify that it is operating properly. While it might be
possible to make the cluster software work over a WAN, it was never
really designed to operate that way. Ideally you need more than one
connection between nodes and a way for one node to fence the other
(STONITH) in order for the cluster software to work properly.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cat /dev/ttyS0

2011-05-23 Thread Greg Woods
On Mon, 2011-05-23 at 13:59 -0700, Hai Tao wrote:
 this might not be too close to HA, but I am not sure if someone has seem this 
 before:
  
 I use a serial cable between two nodes, and I am testing the heartbeat with :
  
 
 server2$ cat  /dev/ttyS0
 server1$ echo hello  /dev/ttyS0
  
 instead of receiving hello on server2, I see some hashed code there. 
  
 Does someone have an idea why I do not receive the hello in clear text?

This normally means there is something wrong with your tty settings (see
man stty). Either your settings at each end do not match, or the
settings you are using will not work with the cable you have. Or perhaps
the pinouts on the cable you are using are incorrect, but if you are
getting something across, it's more likely stty settings than cable
pinouts.

I am not an expert on serial communications so this is about all the
help I can give, but I do know that seeing garbage on a serial tty
usually means the stty settings are wrong. I can also say that I have
used serial heartbeats in the past with success, but some things (like
certain USB-to-serial adapters), I could just never get to work. But
I've never had any trouble getting a serial cable between two on-board
serial ports to work.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?

2011-04-04 Thread Greg Woods
On Mon, 2011-04-04 at 11:44 -0500, Neil Aggarwal wrote:

 From what I can figure out from the ha.cf file, heartbeat 
 uses ping to tell if the peer is up.

Not really. It uses special heartbeat packets to tell if the peer is up.
Ping is used to tell the difference between a dead peer and a bad NIC or
cable. If the NIC or cable is bad, the remote peer would not respond,
but also neither would any of the ping targets. The other node would see
its remote node dead, but the ping targets alive, so it would know to
take over resources. This is a crude method of avoiding split brain
compared to a real STONITH device, but it works surprisingly well in a
number of situations. We ran a number of critical services on
heartbeat-v1 clusters for years until we switched over to using
Pacemaker last year when it became obvious that no one is supporting
heartbeat-v1 configurations any more (we were dragged kicking and
screaming into the much more complicated but also much more flexible and
reliable world of Pacemaker).

 
 I want to switch the virtual IP if the ldirectord process 
 is not running or locked up.  That may happen even if the
 network card is ok.
 
 Is there a way to do that?

You don't say whether or not you are using Pacemaker. If you are, then
you can set up ldirectord as a Pacemaker resource and let Pacemaker
handle the monitoring. If you are not doing that, then you will need
something external to do the monitoring. That is basically a limitation
of heartbeat-v1 resources in general; the individual resources are not
monitored, so it is possible to get into a situation where one or more
resources are hung or crashed, but the heartbeat is still running so no
failover occurs. The only solutions to that involve some sort of
external monitor outside heartbeat (of which Pacemaker seems to be the
recommended one).

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?

2011-04-04 Thread Greg Woods
On Mon, 2011-04-04 at 13:38 -0500, Neil Aggarwal wrote:


 
 crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=192.168.9.101 cidr_netmask=32 \
 op monitor interval=30s
 
 Does that mean heartbeat is being used to detect
 when to move the IP address to the standby server?

Heartbeat is only used to detect situations that require a complete
failover of all resources, i.e. to make sure the other node(s) is still
up and running the cluster software. It is Pacemaker's job to monitor
individual resources and move/restart them if necessary.

This may be a bit oversimplified and I'm sure the cluster guys will jump
in and correct this if I said something wrong.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with DRBD-based resource

2010-12-29 Thread Greg Woods
On Wed, 2010-12-29 at 12:56 +0100, Dejan Muhamedagic wrote:

  Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
  Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
  op=vmgroup1:0_stop_0 )
  Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
  rsc:vmgroup1:0:30: stop
  Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
  Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
  op=vmgroup2:0_stop_0 )
  Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
  rsc:vmgroup2:0:31: stop
  Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
  vmgroup1:0:stop process 8088 exited with return code 6.
  Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info:
  process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6,
  cib-update=36, confirmed=true) not configured
  Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
  vmgroup2:0:stop process 8089 exited with return code 6.
 
 No messages from the drbd RA?

Nothing that I can see. It looks, however, like the same kind of error
is occurring with many or all of the resources. I have attached the
complete halog entries for the time period in question.

 This smells like a bug found in 1.0.9 which should've been
 fixed a while ago:
 
 http://developerbugs.linux-foundation.org/show_bug.cgi?id=2458

After reading that report, it doesn't look like the same problem to me,
but I will freely admit that the logs are hard for me to interpret.
There are entries like this showing what appear to be the correct
parameters:


Dec 28 09:19:13 vmserve.scd.ucar.edu lrmd: [7514]: notice: max_child_count (4) 
reached, postponing execution of operation monitor[10] on ocf::LVM::DRBDVG0 for 
client 7518, its parameters: volgrpname=[DRBDVG0] CRM_meta_timeout=[2] 
crm_feature_set=[3.0.1]  by 1000 ms

 
 If it's not a resource problem (i.e. drbd), please either reopen
 the bugzilla above or open a new one if it looks like a different
 problem. Don't forget to attach hb_report.

If you don't see anything obvious in the attached more complete log, I
will gladly do so. In the meantime, I may have to downgrade pacemaker so
that I can get my cluster back. We are running in non-HA mode right now.

--Greg


Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive: hacluster /usr/lib64/heartbeat/ipfail
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Pacemaker 
support: respawn
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: File 
/etc/ha.d//haresources exists.
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: This file is not 
used because crm is enabled
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive:  hacluster /usr/lib64/heartbeat/ccm
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive:  hacluster /usr/lib64/heartbeat/cib
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive: root /usr/lib64/heartbeat/lrmd -r
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive: root /usr/lib64/heartbeat/stonithd
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive:  hacluster /usr/lib64/heartbeat/attrd
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: respawn 
directive:  hacluster /usr/lib64/heartbeat/crmd
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: AUTH: i=1: key = 
0xd472250, auth=0x2abe40ad76f0, authname=sha1
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Pacemaker 
support: false
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: WARN: Logging daemon is 
disabled --enabling logging daemon is recommended
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: 
**
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Configuration 
validated. Starting heartbeat 3.0.2
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7139]: info: Heartbeat Hg 
Version: node: 7153d58dcb99ff4251449c5404754e26ee1af48e
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: heartbeat: 
version 3.0.2
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: Heartbeat 
generation: 1265221099
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP 
Broadcast heartbeat started on port 694 (694) interface eth0
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP 
Broadcast heartbeat closed on port 694 interface eth0 - Status: 1
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP 
Broadcast heartbeat started on port 694 (694) interface eth3
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: UDP 
Broadcast heartbeat closed on port 694 interface eth3 - Status: 1
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: ping group 
heartbeat started.
Dec 28 09:18:54 vmserve.scd.ucar.edu heartbeat: [7152]: info: glib: ping group 
heartbeat started.
Dec 28 

Re: [Linux-HA] problem with DRBD-based resource

2010-12-29 Thread Greg Woods


 On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote:
  I updated one of my clusters today, and among other things, I updated
  from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related
  or not.

Turns out it is. I downgraded the idle node to 1.0.9 and started
heartbeat there. I then have a working cluster. I then tried disabling
heartbeat on the 1.0.10 node, and got another mutual stonith which ends
up with all the resources on the 1.0.9 node. Then I downgraded the other
node to 1.0.9, and the cluster is now working again in HA mode.

I now feel more confident that this is a bug in 1.0.10, so I will file a
bugzilla.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problem with DRBD-based resource

2010-12-28 Thread Greg Woods
I updated one of my clusters today, and among other things, I updated
from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related
or not.

The problem is that I cannot get the cluster to come up clean. Right now
all resources are running on one node and it is OK that way. As soon as
I start heartbeat on the second node, it goes into a stonith death
match. What I see is some failed actions involving trying to stop a DRBD
resource group. Here is a log snippet:

Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
op=vmgroup1:0_stop_0 )
Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
rsc:vmgroup1:0:30: stop
Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
Performing key=50:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
op=vmgroup2:0_stop_0 )
Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
rsc:vmgroup2:0:31: stop
Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
vmgroup1:0:stop process 8088 exited with return code 6.
Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info:
process_lrm_event: LRM operation vmgroup1:0_stop_0 (call=30, rc=6,
cib-update=36, confirmed=true) not configured
Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: WARN: Managed
vmgroup2:0:stop process 8089 exited with return code 6.


In this example, vmgroup1 and vmgroup2 are DRBD resources, then set
up as clones, which is the standard way to do this. Looks like this in
crm shell:

primitive vmgroup1 ocf:linbit:drbd \
params drbd_resource=vmgroup1 \
op monitor interval=59s role=Master timeout=30s \
op monitor interval=60s role=Slave timeout=20s \
op start interval=0 timeout=240s \
op stop interval=0 timeout=100s
[...]
ms ms-vmgroup1 vmgroup1 \
meta clone-max=2 notify=true globally-unique=false
target-role=Started

This has always worked fine until today. 

Any ideas what I can do to further debug this?

I am running on CentOS 5.5 using the clusterlabs repos.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] strange crm behavior

2010-12-21 Thread Greg Woods
On Tue, 2010-12-21 at 12:09 +0100, Dejan Muhamedagic wrote:
  Could it be that the status
 shown below is part of a node status which is not in the cluster
 any more? Or a node which is down?
 

No, that is not possible. This is a two-node cluster and both nodes have
been up for many days and are both currently running resources. There
have never been any other nodes that were part of this cluster.

--Greg


  [r...@vmserve sbin]# crm resource cleanup VM-paranfsvm
  Cleaning up VM-paranfsvm on vmserve2.scd.ucar.edu
  Cleaning up VM-paranfsvm on vmserve.scd.ucar.edu
  [r...@vmserve sbin]# cibadmin -Q | grep VM-paranfsvm
lrm_resource id=VM-paranfsvm type=Xen class=ocf
  provider=heartbeat
  lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor
  crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
  transition-key=15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
  transition-magic=0:7;15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
  call-id=118 rc-code=7 op-status=0 interval=0
  last-run=1292610207 last-rc-change=1292610207 exec-time=250
  queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
lrm_resource id=VM-paranfsvm type=Xen class=ocf
  provider=heartbeat
  lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor
  crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
  transition-key=17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
  transition-magic=0:7;17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
  call-id=67 rc-code=7 op-status=0 interval=0
  last-run=1292610208 last-rc-change=1292610208 exec-time=240
  queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
  lrm_rsc_op id=VM-paranfsvm_start_0 operation=start
  crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
  transition-key=146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
  transition-magic=0:0;146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
  call-id=68 rc-code=0 op-status=0 interval=0
  last-run=1292610209 last-rc-change=1292610209 exec-time=2540
  queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
  lrm_rsc_op id=VM-paranfsvm_monitor_1
  operation=monitor crm-debug-origin=build_active_RAs
  crm_feature_set=3.0.1
  transition-key=147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
  transition-magic=0:0;147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
  call-id=69 rc-code=0 op-status=0 interval=1
  last-run=1292610213 last-rc-change=1292610213 exec-time=290
  queue-time=0 op-digest=e507fbd4a0eb54917c1cb1e51bafbd7f/
  lrm_rsc_op id=VM-paranfsvm_stop_0 operation=stop
  crm-debug-origin=do_update_resource crm_feature_set=3.0.1
  transition-key=145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559
  transition-magic=0:0;145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559
  call-id=70 rc-code=0 op-status=0 interval=0
  last-run=1292610224 last-rc-change=1292610224 exec-time=5690
  queue-time=30 op-digest=d84dd793335cf339b4757a9041f005ac/


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] strange crm behavior

2010-12-20 Thread Greg Woods
On Mon, 2010-12-20 at 12:40 +0100, Dejan Muhamedagic wrote:

 
 That's strange. resource cleanup should definitely remove the
 LRM (status) part. Can you please try again and then do:
 
 # cibadmin -Q | grep VM-paranfsvm

It seems like it is not removing status info for old removed resources:

[r...@vmserve sbin]# crm resource cleanup VM-paranfsvm
Cleaning up VM-paranfsvm on vmserve2.scd.ucar.edu
Cleaning up VM-paranfsvm on vmserve.scd.ucar.edu
[r...@vmserve sbin]# cibadmin -Q | grep VM-paranfsvm
  lrm_resource id=VM-paranfsvm type=Xen class=ocf
provider=heartbeat
lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor
crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
transition-key=15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
transition-magic=0:7;15:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
call-id=118 rc-code=7 op-status=0 interval=0
last-run=1292610207 last-rc-change=1292610207 exec-time=250
queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
  lrm_resource id=VM-paranfsvm type=Xen class=ocf
provider=heartbeat
lrm_rsc_op id=VM-paranfsvm_monitor_0 operation=monitor
crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
transition-key=17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
transition-magic=0:7;17:4557:7:eabcd13b-33aa-4216-a517-bf5ece092559
call-id=67 rc-code=7 op-status=0 interval=0
last-run=1292610208 last-rc-change=1292610208 exec-time=240
queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
lrm_rsc_op id=VM-paranfsvm_start_0 operation=start
crm-debug-origin=build_active_RAs crm_feature_set=3.0.1
transition-key=146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
transition-magic=0:0;146:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
call-id=68 rc-code=0 op-status=0 interval=0
last-run=1292610209 last-rc-change=1292610209 exec-time=2540
queue-time=0 op-digest=d84dd793335cf339b4757a9041f005ac/
lrm_rsc_op id=VM-paranfsvm_monitor_1
operation=monitor crm-debug-origin=build_active_RAs
crm_feature_set=3.0.1
transition-key=147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
transition-magic=0:0;147:4557:0:eabcd13b-33aa-4216-a517-bf5ece092559
call-id=69 rc-code=0 op-status=0 interval=1
last-run=1292610213 last-rc-change=1292610213 exec-time=290
queue-time=0 op-digest=e507fbd4a0eb54917c1cb1e51bafbd7f/
lrm_rsc_op id=VM-paranfsvm_stop_0 operation=stop
crm-debug-origin=do_update_resource crm_feature_set=3.0.1
transition-key=145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559
transition-magic=0:0;145:4558:0:eabcd13b-33aa-4216-a517-bf5ece092559
call-id=70 rc-code=0 op-status=0 interval=0
last-run=1292610224 last-rc-change=1292610224 exec-time=5690
queue-time=30 op-digest=d84dd793335cf339b4757a9041f005ac/


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple stonith and Heartbeat 2.1.4

2010-11-18 Thread Greg Woods
On Thu, 2010-11-18 at 14:46 +0100, Sébastien Prud'homme wrote:


 I'm using meatware as a second
 stonith resource 

I'm doing this and it works fine.

 Unfortunately after several tests, i didn't find a way to make it
 work: only the first stonith ressource is used (and fails), the
 cluster enter in a loop (trying to use only the first stonith
 ressource) and no ressource migration is done.

You did run the meatclient, right? What was the command you used and the
output of it?

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] debugging resource configuration

2010-11-03 Thread Greg Woods
On Wed, 2010-11-03 at 11:13 +0100, Dejan Muhamedagic wrote:

  ERROR with rpm_check_debug vs depsolve:
  heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64
  Complete!
  (1, [u'Please report this error in
  https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise
  %20Linux%205component=yum'])
 
 Hardly an RPM expert here, but didn't it ask you to report a
 problem with yum? 

I suppose I could go through the motions of doing this, but Red Hat will
likely (and correctly) point out that clusterlabs is a third-party repo.
Since the RHEL package (or at least the downstream CentOS version of it)
will install and run just fine on systems not using the clusterlabs
repo, this really doesn't seem to be a Red Hat or CentOS problem (at
least from their point of view).

In any case I have worked around the problem by using a vanilla CentOS
virtual machine to run ldirectord instead of trying to do it on my
Pacemaker host OS.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debugging resource configuration

2010-11-02 Thread Greg Woods
On Tue, 2010-11-02 at 11:11 +0100, Dejan Muhamedagic wrote:

 If you're using resource-agents, the package should be named
 ldirectord not heartbeat-ldirectord. The two packages should also
 have the same release numbers, probably something like 1.0.3-x.

I figured as much. But there appears to be a problem with the ldirectord
package from clusterlabs, as explained in an earlier message from
Masashi Yamaguchi yamag...@gmail.com:


 think ldirectord rpm package's spec for RedHat/CentOS is
 inconsistent.
 
 $ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm
 config(ldirectord) =3D 1.0.3-2.el5
 heartbeat-ldirectord
 ldirectord =3D 1.0.3-2.el5
 $ rpm -qp --conflicts ldirectord-1.0.3-2.el5.x86_64.rpm
 heartbeat-ldirectord
 $
 
 ldirectord package PROVIDES heartbeat-ldirectord and
 CONFLICTS with heartbeat-ldirectord.
 ldirectord package' spec has self-conflict.
 
 This is a patch for the problem.
 --- resource-agents.spec
 +++ resource-agents.spec
 @@ -71,7 +71,6 @@
  Requires:   %{SSLeay} perl-libwww-perl ipvsadm
  Provides:  heartbeat-ldirectord
  Obsoletes: heartbeat-ldirectord
 -Conflicts: heartbeat-ldirectord
  Requires:  perl-MailTools
  %if 0%{?suse_version}
  Requires:   logrotate
 
 I installed the modified ldirectord package successfully.
 

--Greg




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debugging resource configuration

2010-11-02 Thread Greg Woods
On Tue, 2010-11-02 at 22:24 +0100, Lars Ellenberg wrote:
  
   ldirectord package PROVIDES heartbeat-ldirectord and
   CONFLICTS with heartbeat-ldirectord.
   ldirectord package' spec has self-conflict.
   
   This is a patch for the problem.
   --- resource-agents.spec
   +++ resource-agents.spec
   @@ -71,7 +71,6 @@
Requires:   %{SSLeay} perl-libwww-perl ipvsadm
Provides:  heartbeat-ldirectord
Obsoletes: heartbeat-ldirectord
   -Conflicts: heartbeat-ldirectord
Requires:  perl-MailTools
%if 0%{?suse_version}
Requires:   logrotate
 
 That's incorrect, to the best of my knowledge.
 Though I'm certainly not an RPM wizard.
 
 That seems to be standard procedure for package name changes.
 
 package used to be named some-package,
 package is renamed to other-package.
 other-package now provides, obsoletes, and conflicts with some-package.
 
 If you have a good pointer to some rpm packaging doc saying otherwise,
 please let us know.
 


I do not claim to be an RPM expert either, I was only repeating what
someone else said. According to his report, modifications were needed to
the ldirectord package in order for it to install.

What I do know is that I cannot install it on my CentOS 5 system even
though I have made sure that heartbeat-ldirectord is not already
installed. Here is the result:

[r...@vmserve2 woods]# yum install ldirectord.x86_64
Loaded plugins: dellsysid, fastestmirror
Loading mirror speeds from cached hostfile
 * addons: mirror.ubiquityservers.com
 * extras: mirrors.versaweb.com
Setting up Install Process
Resolving Dependencies
-- Running transaction check
--- Package ldirectord.x86_64 0:1.0.3-2.6.el5 set to be updated
-- Finished Dependency Resolution

Dependencies Resolved


 Package   Arch  Version   Repository
Size

Installing:
 ldirectordx86_641.0.3-2.6.el5 clusterlabs
55 k
Transaction Summary

Install   1 Package(s)
Upgrade   0 Package(s)

Total download size: 55 k
Is this ok [y/N]: y
Downloading Packages:
ldirectord-1.0.3-2.6.el5.x86_64.rpm  |  55 kB
00:00 
Running rpm_check_debug
ERROR with rpm_check_debug vs depsolve:
heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64
Complete!
(1, [u'Please report this error in
https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise
%20Linux%205component=yum'])
[r...@vmserve2 woods]# rpm -q heartbeat-ldirectord
package heartbeat-ldirectord is not installed

I can install heartbeat-ldirectord, but unsurprisingly it does not work
properly with Pacemaker.

For now I gave up installing this on the Pacemaker box, and instead
created a virtual machine, installed heartbeat-ldirectord on it, and
wrote myself a crude monitoring script. This setup is working.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debugging resource configuration

2010-10-29 Thread Greg Woods
On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote:
 Just a shot in the dark here kind of ... but I know that when I had this 
 type of problem with a stonith device it was timeout related. You could 
 try boosting your timeouts all around, or even check what
 
 # time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start
 
 reports back.

[r...@vmx1 log]# time /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start

real0m0.261s
user0m0.188s
sys 0m0.068s

(after which it is working fine; I can connect to the virtual service
and get properly redirected to the real server).

I am convinced now that either there is a bug in the resource agent so
that the monitor process just doesn't work right,  there is something
obvious and stupid that I just don't see in my configuration that is
wrong, or else the ldirectord script that I have (which came from the
CentOS heartbeat-ldirectord package) is incompatible with what the
resource agent is expecting. 


 If timeouts aren't it, I would start breaking out parts of the cluster 
 config and trying it again until it works

I still haven't been able to make it work, but I have eliminated a
number of variables. I got rid of all the IPAddr resources, order
statements, and colocation statements. All that is there now that is
relevant to ldirectord is:

primitive ldirectord ocf:heartbeat:ldirectord \
op start interval=20s timeout=15s \
op stop interval=20s timeout=15s \
op monitor interval=20s timeout=20s 

(I have actually tried different interval and timeout numbers but the
result is always the same).

That's it. Then I configured the eth1:0 interface manually to correspond
with the IP address of the virtual server configured in ldirectord.cf,
and ran crm resource ldirectord start. The result is the same start,
stop, FAILED scenario repeated. The logs appear to show that it is
running the status check every 2 seconds or so, despite my interval and
timeout settings:

[Fri Oct 29 10:11:06 2010|ldirectord.cf|19214] Starting Linux Director
v1.186-ha-2.1.3 as daemon
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Added virtual server:
128.117.64.127:25
 [...]

[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Quiescent real server:
128.117.64.123:25 (128.117.64.127:25
) (Weight set to 0)

[...]
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19216] Restored real server:
128.117.64.123:25 (128.117.64.127:25)
 (Weight set to 1)

(there are similar pairs of entries for all the declared real servers)

So far so good, now comes the problem:

[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:06 2010|ldirectord.cf|19221] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19405] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf status 
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] ldirectord
for /etc/ha.d/ldirectord.cf is running with pid: 19216
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19410] Exiting from ldirectord
status
[Fri Oct 29 10:11:08 2010|ldirectord.cf|19416] Invoking ldirectord
invoked as: /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf stop 

The status check should have succeeded, but the monitor process thinks
it failed. Also as can be seen, the status check is repeated only 2
seconds later. The corresponding log for lrmd shows:

Oct 29 10:11:05 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5526:
start
Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: RA output:
(ldirectord:start:stdout) /usr/sbin/ldirector
d /etc/ha.d/ldirectord.cf start
Oct 29 10:11:06 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:start process 19203 exited with retur
n code 0.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5527:
start
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5527] on ocf::ldirector
d::ldirectord for client 4845, its parameters: CRM_meta_interval=[2]
CRM_meta_timeout=[15000] crm_feature_set=[3.0.1] CRM_meta_name=[start]
for rsc is already running.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916:
postponing all ops on resource ldirectord by 1000 ms
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5527] on ocf::ldirectord::ldirectord for client 4845,
its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000]
crm_feature_set=[3.0.1] CRM_meta_name=[start]  for rsc is already
running.
Oct 29 10:11:07 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910:
operations on 

Re: [Linux-HA] ldirectord on CentOS 5

2010-10-29 Thread Greg Woods
On Fri, 2010-10-29 at 12:09 +0900, Masashi Yamaguchi wrote:

 
 I think ldirectord rpm package's spec for RedHat/CentOS is inconsistent.
 
 $ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm
 config(ldirectord) =3D 1.0.3-2.el5
 heartbeat-ldirectord
 ldirectord =3D 1.0.3-2.el5
 $ rpm -qp --conflicts ldirectord-1.0.3-2.el5.x86_64.rpm
 heartbeat-ldirectord
 $
 
 ldirectord package PROVIDES heartbeat-ldirectord and
 CONFLICTS with heartbeat-ldirectord.
 ldirectord package' spec has self-conflict.
 
 This is a patch for the problem.
 --- resource-agents.spec
 +++ resource-agents.spec

I don't quite get this. Your patch is for resource-agents or ldirectord
packages? I presume the idea is that you get the src rpm, extract it,
apply the patch, and rebuild the RPM? (I haven't been able to find the
src rpm for ldirectord).

If I understand this correctly, it looks like a bug that should be fixed
in the clusterlabs repo. Do they have a place to officially report bugs?

I did try extracting the ldirectord script from the clusterlabs
ldirectord package, and it segfaults, so I suspect I really have to find
a way to install the entire package in order to use that script and get
the heartbeat/pacemaker monitoring to work properly.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods
I currently have an old heartbeat v1 cluster that I am moving to a newer
Pacemaker/heartbeat v3 cluster. That is, I am moving the functionality
of the old cluster to the new one so that the old one can be phased out.
The new cluster is running all the latest stuff from the clusterlabs
repo under CentOS 5.5

One thing the old one does is run Linux Virtual Server and ipvsadm to
farm out incoming SMTP connections to multiple mail processing nodes
(virus scanning, spamassassin scanning, alias lookup, etc.). I would
like to have the new cluster do this.

From what I have read, it appears that the right way to do this is to
install ldirectord and set up an ldirectord resource in Pacemaker. The
problem is that I can't get ldirectord to install. There is an
ldirectord package in the clusterlabs repo, and a heartbeat-ldirectord
package in the CentOS-extras repo, and they conflict. Neither one is
installed now but I still get this error when I try to install
ldirectord:

ERROR with rpm_check_debug vs depsolve:
heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64
Complete!
(1, [u'Please report this error in
https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise
%20Linux%205component=yum'])

The same thing happens if I disable the extras repo, and even if I do
yum clean all first. If instead I try to install heartbeat-ldirectord
and disable the clusterlabs repo (which might result in a package that
doesn't work right in any event), I get a different error:

Transaction Check Error:
  file /usr/lib/ocf/resource.d/heartbeat/ldirectord from install of
heartbeat-ldirectord-2.1.3-3.el5.centos.x86_64 conflicts with file from
package resource-agents-1.0.3-2.6.el5.x86_64


Is going to the source the only way to get ldirectord to install on this
system, or has someone else seen this before and know of a workaround?

Thanks,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods

  The same thing happens if I disable the extras repo, and even if I do
  yum clean all first. If instead I try to install heartbeat-ldirectord
  and disable the clusterlabs repo (which might result in a package that
  doesn't work right in any event), I get a different error:
  
  Transaction Check Error:
file /usr/lib/ocf/resource.d/heartbeat/ldirectord from install of
  heartbeat-ldirectord-2.1.3-3.el5.centos.x86_64 conflicts with file from
  package resource-agents-1.0.3-2.6.el5.x86_64
 
 Try to get rid of the file if it is still there. Try it again afterwards.

I am a little confused. Can I actually install the CentOS extras
heartbeat-ldirectord package from CentOS and expect it to work with
all the clusterlabs stuff? The clusterlabs also has an ldirectord
package.

The situation I have now is that the resource agent script is present
(it's in the resource-agents package), but the actual ldirectord script
is not. So I actually copied the /usr/sbin/ldirectord binary from
another CentOS 5 machine that doesn't have clusterlabs but does have
heartbeat-ldirectord, and then tried to configure an
ocf:heartbeat:ldirectord resource, but when I did the commit, I got this
error reported by crm_mon:

Failed actions:
ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5,
status=complete): not installed
ldirectord_monitor_0 (node=vmx1.ucar.edu, call=79, rc=5,
status=complete): not installed

Seems like there is something in the package besides just the ldirectord
script that is needed.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods
On Thu, 2010-10-28 at 14:52 -0600, Greg Woods wrote:

 I am a little confused.

I was actually more confused than I thought. When I got this error:


 Failed actions:
 ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5,
 status=complete): not installed
 ldirectord_monitor_0 (node=vmx1.ucar.edu, call=79, rc=5,
 status=complete): not installed

I carefully inspected the logs and determined that what this really
meant was that ldirectord couldn't find the config file (it was in a
different place than it was expecting to find it). So I was actually
able to copy over the ldirectord script from another system and get an
ldirectord resource to start, once I put the config file in the correct
place and created an IPAddr resource for the virtual service address.
Running ipvsadm shows that it is working as expected (the virtual and
real servers are correctly reported) and ifconfig shows that the
virtual service address is present. But when I try to connect to the
virtual service, I get connection refused although I can connect to
the real servers just fine. This is a problem that is most likely
outside the HA software and hopefully I will be able to solve it (I did
check firewall rules first).

I still would like to find a solution to the original question though
(how to install an ldirectord package), just for the purposes of making
it easier to keep things updated going forward.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] debugging resource configuration

2010-10-28 Thread Greg Woods
This is a continuation of trying to get ldirectord working under
pacemaker. I have a working installation of ldirectord. I know this
because if I manually configure the eth0:0 pseudo-interface with the
virtual server address, and manually start ldirectord with

# /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf start

...then everything works. I can connect to the virtual service address
and port, and I get properly redirected to one of the real servers.
ipvsadm shows normal output. All looks good.

However, if I try to start the ldirectord resource, it starts, then
fails, then starts, then fails, etc. This will continue until I issue a
resource ldirectord stop command in the CRM shell. 

So it has to be something with how I configured it, but I'm damned if I
can figure it out. Here is what I have that involves this resource:

primitive ldirectord ocf:heartbeat:ldirectord \
op start interval=20 timeout=15 \
op stop interval=20 timeout=15 \
op monitor interval=20 timeout=20 \
colocation vdir-ipi-with-ldirectord inf: vdir-ipi ldirectord
order vdir-ipi-before-ldirectord inf: vdir-ipi ldirectord

The vdir-ipi is an IPAddr resource that will start fine and results in
the eth0:0 alias interface being configured and brought up.

When I issue a resource start ldirectord command from the crm shell,
what I get from lrmd is repeats of this sequence:

Oct 28 18:12:24 vmx1.ucar.edu lrmd: [4842]: info: rsc:vdir-ipi:5464:
start
Oct 28 18:12:24 vmx1.ucar.edu lrmd: [4842]: info: Managed vdir-ipi:start
process 4923 exited with return code 0.

Oct 28 18:12:25 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5466:
start
Oct 28 18:12:25 vmx1.ucar.edu lrmd: [4842]: info: RA output:
(ldirectord:start:stdout) /usr/sbin/ldirectord /etc/ha.d/ldirectord.cf
start
Oct 28 18:12:26 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:start process 5103 exited with return code 0.
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5467:
start
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5467] on ocf::ldirectord::ldirectord for client 4845,
its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000]
crm_feature_set=[3.0.1] CRM_meta_name=[start]  for rsc is already
running.
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2916:
postponing all ops on resource ldirectord by 1000 ms
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2906:
operation start[5467] on ocf::ldirectord::ldirectord for client 4845,
its parameters: CRM_meta_interval=[2] CRM_meta_timeout=[15000]
crm_feature_set=[3.0.1] CRM_meta_name=[start]  for rsc is already
running.
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: perform_op:2910:
operations on resource ldirectord already delayed
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:start process 5221 exited with return code 0.
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5468:
stop
Oct 28 18:12:27 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:stop process 5226 exited with return code 0.
Oct 28 18:12:28 vmx1.ucar.edu lrmd: [4842]: WARN: Managed
ldirectord:monitor process 5265 exited with return code 7.
Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: cancel_op: operation
monitor[5469] on ocf::ldirectord::ldirectord for client 4845, its
parameters: CRM_meta_interval=[2] CRM_meta_timeout=[2]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5470:
stop
Oct 28 18:12:29 vmx1.ucar.edu lrmd: [4842]: info: Managed
ldirectord:stop process 5296 exited with return code 0.

And then it repeats:

Oct 28 18:12:31 vmx1.ucar.edu lrmd: [4842]: info: rsc:ldirectord:5471:
start

etc.

How can I figure out what I have done wrong here?

Thanks,
--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat with postgresql

2010-10-22 Thread Greg Woods
On Fri, 2010-10-22 at 18:32 +0200, Andrew Beekhof wrote:
 if you're just using v1 - thats not a cluster,
 thats a prayer.

Then God must answer my prayers, because I have been using some simple
heartbeat v1/DRBD clusters for YEARS, for critical services like DNS.
They have worked flawlessly and always failed over properly when
individual servers developed problems or had to be taken offline for
maintenance. I work in an environment where four or five nines are not
required.

As I see it, heartbeat v1 is only suitable in situations where you have
a small number of resources, and the resource start order can be defined
strictly linearly. In those cases, it works quite well, and I have a
number of cases like that.

This is my last contribution to this thread. It is obvious that some
people have already decided that hearbeat v1 can't ever work. I am
obviously not going to change anyone's mind, and the obvious fact is
that you can no longer get any support for running heartbeat v1. I
continue to use v1 on some clusters because it is already configured and
working and has done what I needed it to do. My newer clusters are more
complicated and therefore do use pacemaker. My old clusters are due to
be replaced with virtual machines that run on the new clusters, so I
expect in a few months I will have completely phased out v1 anyway.

--Greg




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat with postgresql

2010-10-20 Thread Greg Woods
On Wed, 2010-10-20 at 08:13 +0200, Andrew Beekhof wrote:

  Um, maybe because heartbeat v1 has a much much much much less steep
  learning curve?
 
 I dispute that:
 

 http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple


This addresses the fact that Pacemaker has many features that heartbeat
v1 lacks. That is not in dispute, but it completely sidesteps the point
that heartbeat v1 is sufficient for many uses and much easier to get
working. I have not said that heartbeat v1 is better than pacemaker,
only that it is easier to get working. The question was asked why would
anyone want to use heartbeat v1. Here is one valid answer to that
question. This point has been made on this list before by myself and
others, and yet the question why would anyone want to use heartbeat v1
continues to be asked. I understand that nobody has any interest in
developing heartbeat v1 any more. I accept this, I have moved on to v3
and Pacemaker. But that does not invalidate the answer to the original
question.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat with postgresql

2010-10-19 Thread Greg Woods
On Tue, 2010-10-19 at 10:01 -0600, Serge Dubrouski wrote:
 Any particular reason for using Heartbeat v1 instead of CRM/Pacemaker?

Um, maybe because heartbeat v1 has a much much much much less steep
learning curve? If you have a simple two-node cluster where one node is
just a hot spare, it is way way way way easier to get it working with
heartbeat v1.

The first time I ever set up a high availability cluster, going in
knowing nothing at all about it, I had a heartbeat v1 cluster working in
a couple of days. Already having had considerable heartbeat v1
experience, it took me a couple of months to get a cluster working under
heartbeat v3/Pacemaker. The pace of development is also high enough that
the documentation often lags behind reality. That is not a criticism, I
know how hard it is to keep the documentation up to date (I am already
in that mode now with these new clusters; nobody else knows how they
work so I can't even take a vacation now that I have some production
services running on them, until I finish writing up some administration
procedures).

Yes, no doubt a Pacemaker cluster is far more flexible, but when one
doesn't need all that flexibility and just wants a simple two-node HA
cluster, the simplicity of heartbeat v1 is very attractive.

This shouldn't be a big a mystery as it seems to be. Face up to it:
learning and properly configuring Pacemaker is HARD, even for
experienced sysadmins. And unless you need the additional flexibility
that Pacemaker offers, it seems like a lot of extra effort.

Will I use Pacemaker all the time in the future? Yes, because I have
already put in the effort to learn and configure it. Setting up a new
cluster, where I had an existing one to use as a template, took less
than a week. But that first time, it was difficult, time consuming, and
often frustrating. 

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Standby Node Refuses to Take Over

2010-09-27 Thread Greg Woods
On Mon, 2010-09-27 at 09:43 -0700, Robinson, Eric wrote:

 
 I went so far as to turn off the primary, but the standby still never
 took over. 

Do you have STONITH configured? 

I have run into this too. The primary will not take over unless it is
told somehow that the secondary is really and truly dead. If you have a
real STONITH device such as IPMI, it will cause the secondary to
forcibly power off the primary, providing the guarantee it needs to take
over. On my test cluster where I don't have a working STONITH device
yet, I use the meatware pseudo-device, which allows me to run a
program on one node to inform it that the other node is really dead and
that it is OK to take over. 

My old heartbeat v1 clusters used to work just fine without STONITH.
DRBD split brain would occur every once in a while if both nodes lost
power at the same time, but I could live with this. I wouldn't be
surprised if the newer Pacemaker clusters pretty much require STONITH in
order to work. Maybe someone in the know can confirm or deny this?

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Standby Node Refuses to Take Over

2010-09-27 Thread Greg Woods
On Mon, 2010-09-27 at 12:16 -0700, Robinson, Eric wrote:
 Not sure if you noticed in my previous message that I did physically
 power down the primary but the standby refused to take any action. 

Yes, I did notice that. My point is that I have noted on my clusters
that simply powering it down (i.e. having it suddenly go away) may not
be enough. That requires it to simply assume that the primary has gone
away, and that it's not just a cable or NIC failure. STONITH is a method
of *assuring* that the other node has gone away. It is designed to
prevent both nodes from trying to run the same resources, which can have
disastrous consequences. 

As I noted, I am not certain whether or not using STONITH is absolutely
required now, but I have observed the same symptoms as you, and I ended
up having to configure STONITH in order to get failovers to work
properly.

Usually though, if I explicitly set one node to standby, the other one
will take over, because they can exchange messages that will convince
the remaining node that the standby node will not be running any
resources. 

So I really don't know if STONITH is your problem or would fix your
problem. I only note that I have seen the same symptoms and that was how
I fixed it for my clusters.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] node standby attribute and crm (SOLVED Partially)

2010-09-24 Thread Greg Woods
On Fri, 2010-09-24 at 11:34 -0600, Greg Woods wrote:

 # crm node show
 vmserve2.scd.ucar.edu(16fde08d-b4b6-4550-adfb-b3aab83f706f): normal
 standby: off
 vmserve.scd.ucar.edu(6f5ced83-a790-4519-8449-3d4cf43275b0): normal
 standby: off
 
 On the second cluster:
 
 # crm node show
 vmx1.ucar.edu(62cf0a44-5d0f-475e-a0ac-689537f98f58): normal
 vmx2.ucar.edu(8ad9076e-c571-499b-91e9-4d513fd5be61): normal

This difference can be corrected by running:

# crm node attribute vmx1.ucar.edu set standby off
# crm node attribute vmx2.ucar.edu set standby off

But I don't recall having to do this before, so this does not explain
why the difference occurred in the first place. I also don't know if
this change will last across a reboot, but since it's part of the CIB,
hopefully it will.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Adding DHCPD and NAMED as resources

2010-09-09 Thread Greg Woods
On Thu, 2010-09-09 at 16:35 +0100, Daniel Machado Grilo wrote:
 Another way to do this is if you choose LSB instead of OCF category 
 primitives.
 That way you just select the init script from your init.d and thats it.

You do need to ensure that your init script is LSB compliant. This
includes but is not limited to returning success when a stop is
attempted when it is already stopped. Some init scripts I have seen do
not do this right. Which means this may or may not work correctly.

As an example, on CentOS 5.5 on a system that is running neither service
right now:

[r...@vmserve woods]# ps ax | fgrep named
10329 pts/7S+ 0:00 fgrep named
[r...@vmserve woods]# ps ax | fgrep dhcpd
15451 pts/7S+ 0:00 fgrep dhcpd
[r...@vmserve woods]# service named stop
Stopping named:[  OK  ]
[r...@vmserve woods]# echo $?
0
[r...@vmserve woods]# service dhcpd stop
[r...@vmserve woods]# echo $?
7
[r...@vmserve woods]#

The named script does the right thing but the dhcpd script does not.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Adding DHCPD and NAMED as resources

2010-09-08 Thread Greg Woods
On Wed, 2010-09-08 at 14:18 -0500, Bradley Leduc wrote:
 Am trying to add NAMED and DHCPD services as a resource on
 heartbeat-3.0.1-1.el5 cluster with no luck, I was wondering if anyone would
 know of an easy to do this. Any help would be great.

Are you running pacemaker or just a heartbeat v1-style config? I've done
it both ways. For v1 all I did was add dhcpd to haresources. For
pacemaker, I use the ocf:heartbeat:anything resource since I couldn't
find one specific to named or dhcpd anywhere. So I have config lines
like this:

primitive dhcp ocf:heartbeat:anything \
params binfile=/usr/sbin/dhcpd cmdline_options=-f \
op monitor interval=10 timeout=50 depth=0 \
op start interval=0 timeout=90s \
op stop interval=0 timeout=100s \
meta target-role=Started

named works similarly. 

For v1, you may need to  create a resource.d script that properly
returns 0 if you try to stop a daemon that is already stopped; the
standard init.d startup scripts don't always do this.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problem with static routes

2010-08-23 Thread Greg Woods
On Sun, 2010-08-22 at 10:25 -0600, Greg Woods wrote:
 The basic problem is that
 when I reboot a node in my cluster, it comes back up without its static
 routes. 

I have determined through experimentation that it is the setup/teardown
of Xen networking that is causing this. The static routes also go away
if I just put a node on standby (which shuts down Xen networking), or
even if I put a standby node back online. So I will take this to the
xen-users list. It doesn't look like it has anything to do with the HA
code itself.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] problem with static routes

2010-08-22 Thread Greg Woods
OS: CentOS 5.5
heartbeat: heartbeat-3.0.3-2.3.el5 (latest from clusterlabs)
pacemaker: pacemaker-1.0.9.1-1.15.el5 (latest from clusterlabs)

If it matters, this cluster is primarily used to run Xen virtual
machines (xen-3.0.3-105.el5_5.5 kernel-2.6.18-194.11.1.el5xen latest
from CentOS)

I have been looking off and on for the source of this problem for quite
a while without finding what is causing it. The basic problem is that
when I reboot a node in my cluster, it comes back up without its static
routes. Adding them back in manually works; they stay until the next
reboot. These are defined in /etc/sysconfig/static-routes and are added
by the network service at boot time. I have been able to pretty much
rule out the boot process itself as the source of the problem. I added a
netstat -r -n  /tmp/static-routes command to the rc.local file which
is the very last thing run at boot time and the routes are there. I have
also tried putting nodes into standby (crm node standby) and back
online, and the routes stay there through that. But once I log in after
a reboot, the static routes are gone and I have to manually re-add them.

I can probably work around this using a hideous kludge like having the
rc.local file run a background job that sleeps for a couple of minutes,
then adds the routes, but that doesn't really fix the issue and isn't
guaranteed to work reliably (obviously high reliability is important or
I wouldn't be using HA in the first place).

Has anyone ever seen this before or have any clue where I can look to
troubleshoot this?

Thanks in advance,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Question about grouping with a clone inside group ?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:09 +0200, Alain.Moulle wrote:

 crm configure colocation coloc1 +INFINITY:group1 clone-fs1

This says that group1 and clone-fs1 have to be on the same machine. That
prohibits starting clone-fs1 on a machine where group1 is not running.
That isn't what you meant. I think all you need is the order directive
to make sure clone-fs1 is started before group1.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] time to fork heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:52 -0400, Peter Sylvester wrote:
 I do have to agree.  I've actually been working for almost 4 business days
 now on trying to get Heartbeat and Pacemaker working together 

It took me six months to build a decent cluster, starting as one who was
very experienced with heartbeat v1 master-hotspare pairs. But to be
fair, heartbeat and pacemaker were not the only things I was learning, I
also built the DRBD volumes on top of LVM volumes, then put LVM volumes
on top of the DRBD volumes. That was very complicated to get working,
but provides huge flexibility in that I can increase the size of DRBD
volumes or individual file systems mounted on the DRBD volumes without
major reconfiguration. Then it's on to Xen, heartbeat, and pacemaker.
Eventually I spent quite a bit of time writing myself a management
program that makes it easy to do things like add a new virtual machine
(takes care of running the CRM shell to add the necessary config lines
and that sort of thing). But it was incredibly difficult to get this
working. Failing to configure a single resource properly can start a
stonith death match and bring down the entire cluster. I do see the
advantages of the extra flexibility, and I have begun using some of it.
But there are a lot of use cases where a simple heartbeat v1
configuration is just fine and far easier to understand and implement.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Am I even on the right track here with Heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:13 -0500, Dimitri Maziuk wrote:

  So is it not practical to run RHEL or CentOS 5.x where you'd get this
  version and several more years of disto maintenance?
 
 It's not practical if you want to have both distro maintenance or cluster 
 support. 

I run CentOS 5.5, and there are maintained RPMs in the clusterlabs repo:

http://www.clusterlabs.org/rpm/epel-5

I am running heartbeat 3.0.2 and I notice they have a 3.0.3 now.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] time to fork heartbeat?

2010-08-11 Thread Greg Woods
On Thu, 2010-08-12 at 00:38 +0200, Dejan Muhamedagic wrote:

 On Wed, Aug 11, 2010 at 09:53:01PM -, Yan Seiner wrote:

  Heck, it really should just take two things:
  
  1.  IP of remote computer
  2.  Device to use
 
 Device?
 
  Bang, it just works.
  
  For many of us this would be sufficient.
 
 Hmm, I don't think HA can be that easy.

Probably not, but that doesn't mean it has to be as hard as getting
pacemaker to work currently is either.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Am I even on the right track here with Heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 20:01 -0500, Dimitri Maziuk wrote:

 1) there are installations where throwing in a package from 3rd party repo 
 will cost you a lot. Like tech. support on a very very expensive piece of 
 hardware. (Think giant hardon collider type of hardware.)

Sure, there are some situations where this is not a good option. I only
know it works for me.

 
 The other issue with packages from 3rd party repos is, of course
 2) so how many times did you have to unfsck yum update conflicts so 
 far?

In this particular case: never. There are only a few RPMs in the
clusterlabs repo and they all relate to heartbeat/corosync/pacemaker.

 
 That aside, the real problem for me is I haven't seen V2-style docs that 
 actually made sense yet. 

I found the clusterlabs documents useful, but I too had to learn much
through the school of hard knocks. This is fairly typical of open source
projects; geeks want to code, not write documentation, so often the
documentation does not keep up with the code.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Waiting for confirmation before failover on the backup server

2010-07-23 Thread Greg Woods
On Fri, 2010-07-23 at 06:24 -0700, Mahadevan Iyer wrote:
 When using only heartbeat(no pacemaker) is there a way to do the following
 
   Setup a backup server such that when it tries to take over due to loss of 
 connectivity with the main server, it waits for confirmation from an operator


This is exactly what the meatware STONITH plugin is for. 

http://www.clusterlabs.org/doc/crm_fencing.html

..near the bottom of the page is the description of the meatware plugin.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Manual intervention on failover

2010-07-15 Thread Greg Woods
On Thu, 2010-07-15 at 07:36 -0700, Pushkar Pradhan wrote:
 Hi,
 I have a strange requirement: I don't want failover to happen unless a 
 operator says go ahead or a big timeout has occurred (e.g. 1 hour). I am 
 using Heartbeat R1 style cluster with 2 nodes.
 Is this possible or do I need to write some custom plugin?

This may not be the most elegant solution, but you could do this with
the meatware stonith device which does exactly this; someone has to
manually confirm that the other machine is really and truly dead before
a failover will happen.

This would be easy to set up if you are already using stonith, and a
non-trivial learning curve otherwise.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-28 Thread Greg Woods
On Mon, 2010-06-28 at 10:47 +0200, Dejan Muhamedagic wrote:

  (drbd_xen2:1:probe:stderr) DRBD module version: 8.3.8userland
  version: 8.3.6 you should upgrade your drbd tools!
 
 I guess that you should follow this advice.

Just one data point: I get this message in my logs too, but DRBD works
fine anyway (using the native version from CentOS 5.5).

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
On Sun, 2010-06-27 at 03:02 -0700, Joe Shang wrote:

 Failed actions:
 drbd_xen2:1_start_0 (node=xen1.box.com, call=10, rc=5,
 status=complete): not installed

This is one of the things that I don't like about heartbeat/pacemaker. A
minor error (misconfiguring a single resource) can cause major problems
(like a stonith death match that brings down the entire cluster).

One thing I have seen with Xen VMs is that the default timeouts are too
short. That may not be your particular problem, but you probably need to
increase them anyway. This is an example of what I have:



primitive VM-ldap ocf:heartbeat:Xen \
params xmfile=/etc/xen/ldap \
op monitor interval=10 timeout=120 depth=0
target-role=Stopped \
op start interval=0 timeout=60s \
op stop interval=0 timeout=120s \
meta is-managed=true target-role=Started

Before I added the explicit op start and op stop timeouts, I woulod
get failed stop or start operations and any attempt to fail over would
start a death match.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
On Sun, 2010-06-27 at 07:57 -0700, Joe Shang wrote:

 Jun 27 10:51:49 xen1 lrmd: [3949]: info: RA output:
 (drbd_xen2:1:probe:stderr) 'xen2' not defined in your config.

This looks like an error in your DRBD configuration. What is in
drbd.conf? What does drbd-overview or drbdadm state all show?

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
You could try making one of them primary:

# drbdadm primary xen1

If that doesn't work, you may have encountered a split brain situation.
In that case, you have to tell DRBD that it is OK for one of the
machines to discard the data it has so that the other one can become
primary. Look here:

http://www.drbd.org/users-guide/s-resolve-split-brain.html

One thing is for certain: you must resolve the low level DRBD problem
before there is any chance of bringing your cluster software back up.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith external/rackpdu question

2010-05-20 Thread Greg Woods
On Thu, 2010-05-20 at 18:30 +0100, Alexander Fisher wrote:

 I think I'll use IPMI and rackpdu in the same configuration.

That is exactly what I will eventually try (assuming I ever get any time
to work on my test cluster some more).

It is clear that, no matter what I do, I cannot prepare for every
possible thing that could happen. There will always be a scenario in
which the stonith may fail to work. But the more unlikely that scenario
is, the safer we are.

In our case, the IPMI devices are connected via a crossover cable, so no
switch failure can knock this out. There is still the possibility of a
failure of the IPMI device (including via a complete power loss to one
of the cluster nodes) or a cable failure. To insulate against that, I
will use the PDU as a second stonith device. It will only be used if the
IPMI stonith fails to work. As Alex pointed out, a switch failure (or
switch port failure) could cause the PDU stonith to also fail, but the
chances of that *and* a failure of IPMI happening at the same time is
quite remote. At least I'm not having my entire set of cluster resources
depending on a single ethernet cable, and there is something in place
that will allow the remaining node to take over if one node suffers a
complete power loss (the original scenario I was worried about that
started me down the multiple stonith device path in the first place).

--Greg




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith external/rackpdu question

2010-05-18 Thread Greg Woods

 Do you know that on APC PDUs, you can group outlets across several
 physical PDUs?  I've got a bit more testing to do, but this seems to work ok.
 The plugin is configured to talk to just one outlet on one of the PDUs and the
 PDU does the rest.

No, I didn't know you could do this. I will have to investigate it.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith external/rackpdu question

2010-05-05 Thread Greg Woods
On Wed, 2010-05-05 at 13:29 +0200, Dejan Muhamedagic wrote:

 If these servers have a lights-out device and the power
 distribution is fairly reliable, that could be an alternative for
 fencing.

They do have an IPMI device and it does work. I am trying to insulate
against a failure of the NIC or cable by having a second stonith device.

The cluster I have now is primarily for testing, but eventually we will
be implementing critical services (e.g. DNS, e-mail, DHCP, and
authentication) in virtual machines running on a cluster like this one,
so part of the testing process is to learn what can and can't be done
and where the potential gotchas are. I have discovered that if I
simulate a cable failure by removing it, bad things happen because
stonith cannot succeed. I would not want my DNS system to be vulnerable
to a single cable failing, so I am looking for ways to guard against it.

A complete power outage on one of the nodes also results in bad things
when using IPMI. Again stonith cannot succeed and so the remaining
server will not take over the resources. Yes, these are dual power
supply servers so it is unlikely that something would happen that causes
only one of the servers to completely lose power other than human error
(possibly a motherboard failure as well?) but I am still looking to
determine if there is a way to guard against this. Right now I have a
meatware stonith device set up so that I can at least log in remotely
and manually force the remaining server to take over, but I am looking
for something more automatic. It would be nice to avoid those 3AM phone
calls )-:

I may take a shot at modifying the external/rackpdu stonith plugin at
some point. We can't be the only ones in the world using dual power
supply servers. I'll probably start by unplugging one of the power
supplies on each server and making sure I understand how to use the
plugin in single-outlet mode, then try doing the modifications to
support dual outlets.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] stonith external/rackpdu question

2010-05-04 Thread Greg Woods
We have a pair of servers in a cluster plugged into a pair of APC
rack-mounted PDU's of the sort that could be controlled by this stonith
plugin. My problem is that these are dual power supply servers, which
means I would have to shut down two outlets that are on on two different
PDUs to completely power off one of the nodes. Is it possible to use
this stonith plugin to do that? The documentation on configuring outlet
numbers (from crm ra info stonith:external/rackpdu) is a bit sparse;
it isn't clear that what I want to do is even possible.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat doesn't see other nodes in cluster

2010-04-15 Thread Greg Woods
On Wed, 2010-04-14 at 16:24 -0700, Stephen Punak wrote:

 Heartbeat appears to start just fine on all nodes, but none of them see each 
 other. 

Any chance there is a firewall blocking the heartbeat packets? You'd
still see them with wireshark, but they would be blocked from getting to
the listening application.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] manual fencing

2010-04-11 Thread Greg Woods
On Thu, 2010-04-08 at 17:46 +0200, Dejan Muhamedagic wrote:

 
 Does this help?
 
 $ crm ra info stonith:meatware

Yes, it does! Thank you!

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] trouble with CRM/XEN

2010-04-07 Thread Greg Woods
On Wed, 2010-04-07 at 15:39 +0200, Andrew Beekhof wrote:
  I increased the timeout
  even further (to 120s instead of the minimum recommended 60) and it
  seems to be working. Curious though, because when it does work, the logs
  show that the entire stop operation, including a live migration, takes
  only about 7 seconds.
 
 It depends on what else the machine is doing.
 Are there any other Xen instances that might be migrating too?

The test cluster currently has two Xen VM's, one is tied to a particular
DRBD volume, so it has colocation and order constraints so that it must
shut down, wait for the DRBD/Filesystem/LVM stack to fail over, and
restart. Still, even that doesn't take more than 60 seconds. The other
VM is stored on an NFS volume so that it can live migrate
(allow-migrate=true). I have seen failures of the stop operation on
both of them prior to increasing the timeout.

Surely it's not handling the resources sequentially? That will be a
disaster if we get to where I want to be going, which may involve dozens
or even hundreds of VMs on a cluster. I realize I may have to adjust the
timeout up higher for the simple reason that a few dozen VM's shutting
down in parallel is going to take longer than one or two in parallel due
to sharing of host OS resources, but hopefully the timeout won't be a
linear function of the number of VMs.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] manual fencing

2010-04-07 Thread Greg Woods
On Wed, 2010-04-07 at 12:13 +0200, Dejan Muhamedagic wrote:

 There's not much magic, you just configure two stonith resources
 and assign different priority, then they'll be tried in a
 round-robin fashion. For instance:
 
 primitive st-real stonith:ipmilan \
   params ... priority=10  # it will try this one first
 primitive st-real stonith:meatware \
   params ... priority=20

Yes, but if you don't KNOW that, then it's magic :-)

Finding out what the parameters are for a given resource definition and
what they do is the magic part; I often cannot find good documentation
on this. The help features in the crm shell are useful; that often tells
me what the parameters *are*, but it doesn't tell me what they *do*, or
what a reasonable value might be. Fortunately we have the mailing list;
thanks again.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] order directives and pacemaker update

2010-04-06 Thread Greg Woods
On Tue, 2010-04-06 at 12:29 +0200, Dejan Muhamedagic wrote:

 
 There's a crm shell bug in 1.0.8-2 in the validation process.
 Either revert to the earlier pacemaker or apply this patch:
 
 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/422fed9d8776

OK, that's a relief. I chose to apply the patch because there are some
features of 1.0.8-2 that I like (such as warning me when I have
forgotten to explicitly set the start/stop timeouts). But after applying
the patch, and figuring out how to compile Python into bytecode, it now
works! Thank you very much. Presumably the next released version will
have this patch in it so this is a one-time thing.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] manual fencing

2010-04-06 Thread Greg Woods
I'm looking for a good way to deal with the total power drop case. I
am using an iDrac 6 as a stonith device on a pair of Dell R710 servers.
I tried the power drop test today by simply unplugging the power on one
of the nodes. What happens in this case is that the attempt by the other
node to stonith the dead node fails, so the other node refuses to take
over resources.

Since this is a fairly rare scenario (the machines have dual power
supplies and use the same pair of power circuits, so the chances that
one node completely loses power and the other doesn't are almost
nonexistent, the way this could happen is a human accidentally powering
off the wrong machine), I'd be willing to deal with it in manual mode as
long as it can be done remotely. Is there any way to manually fence a
node that I know is dead? I.e. to tell the still-running node I know
the other node is dead even though you can't stonith it, please pretend
the stonith succeeded and take over resources? 

Thanks,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] manual fencing

2010-04-06 Thread Greg Woods
On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote:

 I think the solution is using UPS or PDU for STONITH device.

That could improve things in some scenarios, but it does not completely
solve the problem. The cluster is still vulnerable to having the entire
power strip for one node unlpugged or turned off. No matter what the
stonith device is, there is always the possibility of failure of the
stonith device itself. My goal is to be able to recover from something
like these remotely, before I can actually get there to correct the real
problem.

In fact, the chance that stonith will fail because one of the nodes has
completely lost power due to hardware failure while the other one still
has power is extremely small. They both have dual power supplies and
they both use the same two circuits, so the only way that is at all
likely where I could get into the state I am concerned about would be
human error. Unfortunately we do have a lot of people with machine room
access, which makes the possibility of someone powering off the wrong
machine by mistake a real possibility. The chance that two power
supplies would fail at the same time is remote. Unfortunately, human
error is also possible using a controllable power strip as the stonith
device; that doesn't really solve my problem.

I do think I found something that might work. I'm not sure yet, but it
looks like I can create a stonith:meatware resource in addition to the
stonith:ipmilan resource. That would allow me to manually confirm that
the powerless node is in fact dead and have the remaining node take
over. That confirmation can be done by logging in to the live node
remotely, so it will serve my needs if I can figure out the magic
incantation to configure this correctly.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] trouble with stonith [SOLVED]

2010-04-05 Thread Greg Woods
On Sat, 2010-04-03 at 22:55 +0200, Dejan Muhamedagic wrote:

 That should've probably caused a connection timeout and this
 message:
 
 IPMI operation timed out... :(
 
 Was there such a message in the log?

Now that I know to look for it, yes. So far I am having a great deal of
difficulty sifting through the logs to find the messages that are
relevant to whatever problem I am trying to solve, then interpreting
what they actually mean when I do find them. I have a big learning curve
still to climb.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] trouble with CRM/XEN

2010-04-05 Thread Greg Woods
On Sat, 2010-04-03 at 22:45 +0200, Dejan Muhamedagic wrote:

  I spoke too soon; now I am getting failures when stopping the Xen
  resources manually as well. I can't get both nodes online at the same
  time unless I disable stonith.
 
 There should be something in the logs. grep for lrmd and the
 lines containing the resource name.


What I see is that the stop operation timed out. I increased the timeout
even further (to 120s instead of the minimum recommended 60) and it
seems to be working. Curious though, because when it does work, the logs
show that the entire stop operation, including a live migration, takes
only about 7 seconds.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] order directives and pacemaker update

2010-04-05 Thread Greg Woods
Since I applied the most recent Pacemaker update last Friday (now
running  pacemaker-1.0.8-2.el5.x86_64 on CentOS 5), I can no longer
enter order directives. I am using the exact same syntax that I used
previously, and the syntax matches some existing directives, but the crm
shell won't take it. Here is an example:

crm(live)configure# primitive VM-cfvmserve ocf:heartbeat:Xen params
xmfile=/etc/xen/cfvmserve op monitor interval=10 timeout=120
depth=0 op start interval=0 timeout=60s op stop interval=0
timeout=120s meta allow-migrate=true target-role=Stopped
crm(live)configure# order vmnfs-before-cfvmserve inf: vmnfs-cl
VM-cfvmserve
crm(live)configure# verify
ERROR: cib-bootstrap-options: attribute dc-version does not exist
ERROR: cib-bootstrap-options: attribute cluster-infrastructure does not
exist
ERROR: cib-bootstrap-options: attribute last-lrm-refresh does not exist

I don't know why I am getting these errors and I am not sure they are
relevant to the problem I am seeing. But here's what happens later:

crm(live)configure# commit
element rsc_order: validity error : IDREF attribute first references an
unknown ID vmnfs-cl

vmnfs-cl is a clone resource of a file system mount. That resource is,
and always has been, present. I also have a number of order directives
that reference it that are already in the CIB and are working. Here's a
snippet from crm configure show:

primitive vmnfs ocf:heartbeat:Filesystem \
params directory=/vmnfs device=phantom.ucar.edu:/vol/dsgtest
fstype=nfs \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s

clone vmnfs-cl vmnfs \
meta target-role=Started

order vmnfs-before-linstall inf: vmnfs-cl VM-linstall

(this last is an example of one that is already present that works).

Lastly, another part of the configure show output addressing the ERROR
messages above:

property $id=cib-bootstrap-options \
dc-version=1.0.8-3225fc0d98c8fcd0f7b24f0134e89967136a9b00 \
cluster-infrastructure=Heartbeat \
stonith-enabled=true \
last-lrm-refresh=1270484103 \
default-resource-stickiness=

This is currently preventing me from being able to add any more virtual
machines. The ones that are already in are working (including proper
failovers and migration). So is this a bug in the new code or is it
something I was doing wrong that is now being flagged that I just
luckily got by with before?

I can send the full configuration if that is deemed necessary but I
would have to sanitize it to remove idrac passwords, local IP addresses,
and so forth, so I won't do that unless it's the only way to figure this
out.

--Greg




___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] trouble with stonith [SOLVED]

2010-04-02 Thread Greg Woods
On Thu, 2010-04-01 at 15:38 -0600, Greg Woods wrote:

 node1# stonith -t ipmilan -F node2-param-file node2
 
 This works both ways; the remote node reboots. So I should be able to
 rule out DRAC configuration issues. I have also checked, double-checked,
 and triple checked that the parameters in the stonith resources are
 specified correctly 


..but I still missed one. It sure would be nice if the log would tell me
something other than it failed when there is a mistake in the
parameters. I literally looked at it 4 times before I noticed that the
port number was wrong.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] trouble with CRM/XEN

2010-04-02 Thread Greg Woods
I am having difficulty achieving a clean failover in a Pacemaker 1.0.7
cluster that is mainly there to run Xen virtual machines. I realize that
nobody can tell me exactly what is wrong without seeing an awful lot of
configuration detail; what I am looking for is more like some general
methods I can use to debug this.

In a nutshell: if I manually stop all the Xen resources first with a
command like crm resource stop vmname), then failover works perfectly,
and restarting them all manually after a failover also works and
everything appears to be running fine. However, if I just stop heartbeat
on node1, then restart it, then the attempts to stop Xen resources on
node2 (preparatory to moving them back to node1) all fail, resulting in
a stonith of node2 from node1. node1 will start up all the resources,
but when node2 reboots, the process repeats: attempts to stop the Xen
resources on node1 fail, resulting in a stonith of node1 from node2.
Kind of a delayed death match. The only way to break the cycle is to
manually stop the Xen resources before bringing a recovered node back
online. Stop works fine when invoked manually, but fails when invoked
automatically as a result of an attempt to move resources back to a
recovered node.

I have already tried setting allow-migrate=false on all the Xen resource
definitions just to eliminate one more complication until I can figure
this out.

Any ideas on how I can debug this? The HA logs don't seem to be terribly
helpful, they only indicate that the stop operation failed but say
nothing as to why it failed.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] where is CRM 'default timeout' set?

2010-04-02 Thread Greg Woods
In the process of trying to fix two other problems, I messed something
up badly. Now when I go into the crm shell to edit the configuration, on
verify I get a message like this for every one of my configured
resources:

WARNING: vm-ip1: default timeout 20s for start is smaller than the
advised 90

The 20s is the same for every one but the advised value varies for
different types of resources. I have never seen a message like this
before so I have no idea why it suddenly started (although I did update
the pacemaker package today, so that could be the reason why I haven't
seen it before). Where is this 20s default timeout being set? What
does this message *really* mean?

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] trouble with CRM/XEN

2010-04-02 Thread Greg Woods
On Fri, 2010-04-02 at 13:02 -0600, Greg Woods wrote:

 In a nutshell: if I manually stop all the Xen resources first with a
 command like crm resource stop vmname), then failover works perfectly,

I spoke too soon; now I am getting failures when stopping the Xen
resources manually as well. I can't get both nodes online at the same
time unless I disable stonith.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] where is CRM 'default timeout' set? [SOLVED]

2010-04-02 Thread Greg Woods
On Fri, 2010-04-02 at 13:53 -0600, Greg Woods wrote:

 WARNING: vm-ip1: default timeout 20s for start is smaller than the
 advised 90

Found the answer for this one in the gossamer-threads for the
pacemaker list. Should have thought of looking there first.

For those who are struggling with the documentation as much as I am: it
is recommended that the warnings like this be eliminated by setting the
start and stop timeouts for the individual resource. This is done in the
crm shell inside the primitive command that defines the resource.
Inserting a line like this would get rid of the above warning:

op start timeout=90s \


Easy once you know the magic incantation.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] trouble with stonith

2010-04-01 Thread Greg Woods
I'm trying to get stonith to work on a two-node cluster using Dell
iDrac. If I run stonith manually with a command like:

node1# stonith -t ipmilan -F node2-param-file node2

This works both ways; the remote node reboots. So I should be able to
rule out DRAC configuration issues. I have also checked, double-checked,
and triple checked that the parameters in the stonith resources are
specified correctly and match those from the param-files (almost, see
below).

However, when the cluster starts, the stonith resources fail to start.
If I run a cleanup command to clear out the old status, here is what
happens:


Apr 01 15:20:13 vmserve.scd.ucar.edu lrmd: [13093]: debug:
stonithd_receive_ops_result: begin
Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug: Child
process unknown_stonith-vm1_monitor [13094] exited, its exit code: 7
when signo=0.
Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug:
stonith-vm1's (ipmilan) op monitor finished. op_result=7
Apr 01 15:20:13 vmserve.scd.ucar.edu stonithd: [5607]: debug: client
STONITH_RA_EXEC_13093 (pid=13093) signed off
Apr 01 15:20:13 vmserve.scd.ucar.edu lrmd: [5606]: WARN: Managed
stonith-vm1:monitor process 13093 exited with return code 7.
 confirmed=true) not running

One possible issue is that the param-files specify
reset_method=power_cycle, but if I try to set this with crm edit, it
says that reset_method is an unknown parameter:

ERROR: stonith-vm1: parameter reset_method does not exist

This inside crm immediately on exiting the editor.

Any ideas on how I can repair this so that the stonith resources will
start properly? Any other information I should provide?

Thank you,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] NFS and DRBD

2010-03-23 Thread Greg Woods

  On one node, i can get all services to start(and they work fine), but
  whenever fail over occurs, there's nfs related handles left open thus
  inhibiting/hanging the fail over. more specifically, the file systems fails
  to unmount.

If you are referring to file systems on the server that are made
available for NFS mounting that hang on unmount (it's not clear from the
above if your cluster nodes are NFS servers or clients), then you need
to unexport the file systems first, then you can umount them. I handled
this by writing my own nfs-exports RA that basically just does an
exportfs -u with the appropriate parameters, and used an order line
in crm shell to make sure that the Filesystem resource is ordered before
the nfs-exports resource. The nfs-exports resource will export the file
system on start, and unexport it on stop.

--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] getting stonith working [SOLVED]

2010-03-09 Thread Greg Woods
On Mon, 2010-03-08 at 16:56 +0100, Dejan Muhamedagic wrote:
 Hi,
 
 On Fri, Mar 05, 2010 at 03:07:45PM -0700, Greg Woods wrote:
  Partially solved, anyway.
 
 Glad you got it solved, but why do you say partially?

Because I managed to get it working without ever figuring out exactly
what it was I had done wrong. 

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] getting stonith working

2010-03-05 Thread Greg Woods
I am in the process of climbing the learning curve for Pacemaker. I'm
using RPMs from clusterlabs on CentOS 5:

heartbeat-3.0.2-2.el5
pacemaker-1.0.7-4.el5

It has been a long hard struggle, but I have mostly gotten my two-node
cluster working. But I've hit a wall trying to get stonith to work. I
used these commands (sanitized) in the CRM shell:

crm(live)configure# primitive stonith-vm1 stonith:ipmilan params
auth=straight hostname=node1.fqdn ipaddr=** login=*
password=* priv=admin port=23
crm(live)configure# location nosuicide-vm1 stonith-vm1 rule -inf: #uname
eq node1.fqdn

Committing seems to work, but it fails to start the stonith resource.
The error I get in the logs is:

Mar 05 12:30:55 node2.fqdn stonithd: [6982]: WARN: start stonith-vm1
failed, because its hostlist is empty

I have Googled up previous e-mail messages about this error message but
no solution was posted. Where is the hostlist set? If I try to use that
as a parameter, I get an error that there is no such parameter. 

Just for grins I tried the equivalent thing (some of the parameter names
are slightly different) using an external/ipmi stonith device and got
the same error. I must be missing something very fundatmental.



Thanks for any pointers,
--Greg



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] getting stonith working [SOLVED]

2010-03-05 Thread Greg Woods
Partially solved, anyway.

On Fri, 2010-03-05 at 12:52 -0700, Greg Woods wrote:

 crm(live)configure# primitive stonith-vm1 stonith:ipmilan params
 auth=straight hostname=node1.fqdn ipaddr=** login=*
 password=* priv=admin port=23
 crm(live)configure# location nosuicide-vm1 stonith-vm1 rule -inf: #uname
 eq node1.fqdn
 
 Committing seems to work, but it fails to start the stonith resource.
 The error I get in the logs is:
 
 Mar 05 12:30:55 node2.fqdn stonithd: [6982]: WARN: start stonith-vm1
 failed, because its hostlist is empty

It appears that this is a generic error that can happen if there is any
kind of error in the values of the parameters that can't be detected at
resource creation time. In this example, it turns out that auth=straight
isn't supported. After an hour or so of playing around with the
stonith command, I finally got pointed to the README.ipmilan file so
that I could create a config file that worked for invoking the stonith
command manually. That is where I discovered that auth=straight does not
work on my systems, but auth=md2 does (it doesn't really matter what
auth type I use since the IPMI devices are connected by a crossover
cable and are not on a public net). Changing the value of the auth
parameter from straight to md2 got rid of the empty hostlist error.

--Greg





___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Backup of SVN repositories

2008-12-03 Thread Greg Woods
On Wed, 2008-12-03 at 21:23 +, Todd, Conor (PDX) wrote:

  I can't do this using a crontab because one never knows which host 
 will be running the SVN service (and have the disks mounted for it). 


  Has anyone else tackled this issue yet?

You may not know in advance whether a given host is the master, but you
can check at run time. I do this and it works fine; I have a variety of
cron jobs on several different heartbeat/DRBD clusters that I want to
run only on the master, so just check for the presence of something that
will only be there when the shared storage area is mounted:

* * * * [ -d /rep/mysql ]  cron-script

This is when the DRBD shared disk is mounted as /rep, and /var/lib/mysql
is a symlink to /rep/mysql for a MySQL service. The condition is true
only when that host is the master, so cron-script only runs on the
master. Kludgy but simple; works for me.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA of slpad (openLDAP) with Heartbeat and IPAddr.

2008-11-24 Thread Greg Woods
On Mon, 2008-11-24 at 18:47 +0530, Divyarajsinh Jadeja wrote:
 Hi,
 
 
 
 I am new to Heartbeat. How can we configure openLDAP with Heartbeat for
 High-availability of Authentication.?
 
 
 
 I need to have slapd running on both the machine because, ldap replication
 needs slapd on both node. 

I tried it this way. I never could figure a reliable way to set things
up without creating replication loops. It is far easier to use shared
storage via DRBD to replicate the LDAP data rather than using LDAP
replication. Then you do not need to run slapd on the node until it
becomes the master and it is then a standard heartbeat-manageable
resource. I do this and it works great.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Running OpenLDAP and MySQL w/Linux HA

2008-11-20 Thread Greg Woods
On Wed, 2008-11-19 at 13:50 -0800, Rob Tanner wrote:

 The next thing to try is OpenLDAP and MySQL, both of which are critical 
 services and both of which are for more complex.  Is anyone running them 
 on Linux HA.  Does it work reliably when you switch, etc.  How do you 
 have it all configured?

I run both of these under heartbeat v1. It works quite well. What you
need is some shared storage, so that you don't have to mess with LDAP or
MySQL replication. I found when I tried to set up one LDAP server as
master and one as slave, and have the slave take over as master, it was
very easy to create infinite replication loops. MySQL replication that
is truly bidirectional is very difficult to get right. I found it was
much easier to just create shared storage for the LDAP and MySQL
database files using DRBD. 

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] New user questions, config file locations and hb_gui

2008-10-23 Thread Greg Woods
On Thu, 2008-10-23 at 10:48 -0600, Landon Cox wrote:

 b) apache, postgresql, mysql and some custom services are always  
 running on both machines to reduce startup times on failover

You might want to carefully consider the tradeoff here. Getting two-way
database replication to work reliably can be a huge headache. I have no
experience with postgresql,  but I've never been able to make this work
with mysql. I found it was just easier to use DRBD to replicate the
database at the disk partition level, and put up with the startup time
on failover. Even with a good-size database (that stores several days
worth of e-mail for our 1200-employee organization), it's at most a few
seconds for mysql startup. It was a small price to pay to avoid the
headaches associated with database-level replication. Do you really have
an application where you can't even afford a few seconds down time at
failover?

It is also unclear to me that you can bind an application to an
interface like eth0:0 that doesn't even exist when the application is
started (it is created by heartbeat at failover time). Thus it might not
even be possible to have your apps running before failover and have them
listening on the service address after failover. Has anyone actually
tried this?

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] New user questions, config file locations and hb_gui

2008-10-23 Thread Greg Woods
On Thu, 2008-10-23 at 13:24 -0600, Landon Cox wrote:

  Do you really have
  an application where you can't even afford a few seconds down time at
  failover?
 
 No.   Anything sub-60 seconds would be tolerated.

In that case, I really think it will be easier to set up DRBD. That way
you can automatically replicate anything: web content, apache config
files, databases, etc. by just creating the appropriate symlinks into
the shared partition. Just be certain you never put anything in the
shared partition that is needed at boot time or when the machine is not
in primary mode (an obvious and particularly stupid example of this
would be /etc/passwd).

 Controlling the order so IPAddr2 fires and finishes synchronously  
 before starting apache or postgres, for example, is feasible,  
 correct? 

Yes. I personally have never used the xml-style configuration or the
hb_gui, so I can't tell you exactly how you would do this. But in a
v1-style haresources file, you specify the order in which resources are
started, and you always have the IPaddr2 resources first, followed by
drbddisk and Filesystem, and finally your service daemons last.

--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] USB serial links?

2008-10-17 Thread Greg Woods
I am aware that heartbeat can be done over USB links using USB-Ethernet
interfaces. I specifically do not want to do that, because I am looking
for a heartbeat link that will be independent of the IP stack, but on
machines that do not have native serial ports. So I got a couple of
Keyspan USB-Serial adapters which I have had success using on my laptop
for various purposes. However, if I connect the same null modem cable
between two of these that works fine between two on-board serial ports,
it fails the cat test; I don't see anything on the other side. Both
machines properly detect the adapter and create /dev/ttyUSB0, but the
link does not work.

Is a different sort of serial cable needed for this? Is there a problem
with this particular type of USB Serial adapter and something else would
work better? Has anyone successfully gotten a USB-Serial heartbeat to
work?

This is CentOS 5 on x86_64 with heartbeat 2.1.3-3.el5.centos if it
matters.

Thanks,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] heartbeat gets into weird state

2008-10-16 Thread Greg Woods
I've been using heartbeat for years, since the 1.0 days, and I've never
seen anything quite like this before. I'm running
heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository)
on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2
features though, it's a standard v1 configuration. I have a shared
partition with DRBD. It is a dual-homed machine and both sides have a
heartbeat-managed service address. The system runs a freeradius server
(which only listens on one of the shared addresses because otherwise we
run into problems with the radius responses coming from a different IP
address than the client sent them to, which doesn't work) and some local
daemons that are started out of xinetd (both under heartbeat control).
In practice, up until yesterday afternoon, this has worked very well,
with failovers taking only a few seconds and everything coming up
properly.

Yesterday, we started getting calls that radius was not working. I tried
it and it worked fine. It took a while to figure out, but it turns out
that radius was working, but only for clients on the subnet directly
connected to the service address. The same was true of pings; I could
ping the service address only from the directly-connected subnet. So
this is not a radius issue. Sounds like a lost default route, right?
Wrong. The routing table looked fine. And, even weirder, I could ping
the local address of the same interface from off net. I could ping
www.google.com from the affected host. Only the service address was not
reachable from off net, but it worked fine for hosts on the local
subnet. 

I screwed around with this for a bit while the users continued to pound
on our customer service people, and finally decided to hell with it,
let's just fail over to the other machine and get things working again.
So I did a service heartbeat stop to cause a failover, and it hangs on
the dreaded:

WARN: Shutdown delayed until current resource activity finishes

This basically hung forever until I hit the power button, at which point
the other machine took over and all has been well since.

But obviously I need to find out what happened here. Has anyone else
ever seen anything like this, were the service address only works on the
directly-connected subnet whereas the home address works from
anywhere?

I've also investigated the warning message, and all I see are people
asking about this and getting no answer, or being told it's a known bug
and they need to upgrade heartbeat. Is that the case for me too?

Thanks,
--Greg


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems