Re: [Linux-HA] odd cluster failure

2017-02-09 Thread Greg Woods
On Thu, Feb 9, 2017 at 2:28 AM, Ferenc Wágner wrote: > Looks like your VM resource was destroyed (maybe due to the xen balloon > errors above), and the monitor operation noticed this. > Thank you for helping me interpret that. I think what happened is that the VM in question

[Linux-HA] odd cluster failure

2017-02-03 Thread Greg Woods
For the second time in a few weeks, we have had one node of a particular cluster getting fenced. It isn't totally clear why this is happening. On the surviving node I see: Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list Feb 2 16:48:52

[Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
I notice that the network:ha-clustering:Stable repo for CentOS 6 now contains Corosync 2.3.3-1 . I am currently running 1.4.1-17 . Is it safe to just run this update? Are there configuration changes I have to make in order for the new version to work? (If there is a document or wiki page

Re: [Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
On Wed, Oct 1, 2014 at 8:44 AM, Digimer li...@alteeve.ca wrote: Personally, I would not upgrade. If you do, you will want to test outside of production first. Of course, I would always do that anyway, even without a major version number change. Corosync needed cman to be a quorum

Re: [Linux-HA] Corosync 1 - 2

2014-10-01 Thread Greg Woods
On Wed, Oct 1, 2014 at 2:04 PM, Digimer li...@alteeve.ca wrote: Who runs the repo? It's not a name I am familiar with. It comes from opensuse.org . I'm pretty sure I got it out of one of the documents on the clusterlabs site, but I would have to go back and verify that to be certain. --Greg

Re: [Linux-HA] Multiple colocation with same resource group

2014-02-21 Thread Greg Woods
On Fri, 2014-02-21 at 12:37 +, Tony Stocker wrote: colocation inf_ftpd inf: infra_group ftpd or do I need to use an 'order' statement instead, i.e.: order ftp_infra mandatory: infra_group:start ftpd I'm far from a leading expert on this, but in my experience,

Re: [Linux-HA] drbd disks in secondary/secondary diskless/diskless mode

2013-08-14 Thread Greg Woods
On 08/14/2013 02:12 PM, Fredrik Hudner wrote: I have tried to make one node primary but only get: 0: State change failed: (-2) Need access to UpToDate data Command 'drbdsetup primary 0' terminated with exit code 17 When you've suffered a sudden disconnect, you can get into a situation where

[Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I have two clusters that are both running CentOS 5.6 and heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the first one and pacemaker-1.0.12-1.el5 on the other) They both have identical ha.cf files except

Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
I know it's tacky to reply to myself, but I can answer one of my questions after another 15 minutes or so of poring through logs: On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote: The questions are what do these messages actually mean, why is one cluster logging them and not the other

Re: [Linux-HA] heartbeat 'ERROR' messages

2013-05-28 Thread Greg Woods
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote: respawn hacluster /usr/lib64/heartbeat/ipfail crm respawn I don't know about the rest, but definitely do not use both ipfail and crm. Pick one :) I guess I will have to look into what ipfail really does. I have a half dozen

Re: [Linux-HA] Antw: Re: vm live migration without shared storage

2013-05-24 Thread Greg Woods
On Fri, 2013-05-24 at 10:45 +0200, Ulrich Windl wrote: You are still mixing total migration time (which may be minutes) with virtual stand-still time (which is a few seconds). Correct. It was not clear (to me) that when the time to migrate was several minutes, the actual service outage was

Re: [Linux-HA] vm live migration without shared storage

2013-05-23 Thread Greg Woods
On Thu, 2013-05-23 at 15:00 -0400, David Vossel wrote: Migration time, depending on network speed and hardware, is much longer than the shared storage option (minutes vs. seconds). This is just one data point (of course), but for the vast majority of services that I run, if the live

Re: [Linux-HA] Antw: DRBD NetworkFailure

2013-04-24 Thread Greg Woods
On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote: Greg Woods wo...@ucar.edu schrieb am 23.04.2013 um 21:20 in Nachricht Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed Apr 19 17:02:22 vmn2 kernel: block

Re: [Linux-HA] drbd error message decoding help

2013-04-24 Thread Greg Woods
On Wed, 2013-04-24 at 12:11 +0200, Lars Ellenberg wrote: drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2: Calling /usr/sbin/crm_master -Q -l reboot -v 1 I apologize for the noise about this. Further checks of the logs on all my clusters show that this is normal behavior. I started

Re: [Linux-HA] clean shutdown procedure?

2013-04-23 Thread Greg Woods
On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote: On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote: On Saturday, April 20, 2013, Greg Woods wrote: Often one of the nodes gets stuck at Stopping HA Services That means pacemaker is waiting for one of your resources to stop

[Linux-HA] DRBD NetworkFailure

2013-04-23 Thread Greg Woods
Here's a new issue. We have had two outages, about 3 weeks apart, on one of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this was logged: Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in time. Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary - Unknown )

Re: [Linux-HA] clean shutdown procedure?

2013-04-19 Thread Greg Woods
On Fri, 2013-04-19 at 16:43 +0200, Florian Crouzat wrote: crm configure property OK, thanks for the suggestions. What is the difference between maintenance-mode=true and stop-all-resources=true? I tried the latter first, and all the resources do stop, except that all the stonith resources are

[Linux-HA] drbd error message decoding help

2013-04-19 Thread Greg Woods
I realize that nobody can solve a problem based on a single log entry, but I am trying to understand what happened with a cluster problem today. A similar thing happened with this cluster about 3 weeks ago, so this is one of those hard-to-solve intermittent issues. But it might help me now if I

Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Greg Woods
On Sun, 2013-03-24 at 01:36 -0700, tubaguy50035 wrote: params ipv6addr=2600:3c00::0034:c007 nic=eth0:3 \ Are you sure that's a valid IPV6 address? I get headaches every time I look at these, but it seems a valid address is 8 groups, and you've got 5 there. Maybe you mean

Re: [Linux-HA] Using a Ping Daemon (or Something Better) to PreventSplit Brain

2013-01-31 Thread Greg Woods
On Thu, 2013-01-31 at 02:09 +, Robinson, Eric wrote: the secondary should wait for a manual command to become primary. That can be accomplished with the meatware STONITH device. Requires a command to be run to tell the wannabe primary that the secondary is really dead (and, of course, you

Re: [Linux-HA] how to diagnose stonith death match?

2013-01-09 Thread Greg Woods
On Thu, 2013-01-10 at 08:35 +1100, Andrew Beekhof wrote: On Wed, Jan 9, 2013 at 4:16 PM, Greg Woods wo...@ucar.edu wrote: I got the cluster running with xend by moving the heartbeat to a different interface. Having heartbeat start after the bridge is created _should_ also work

Re: [Linux-HA] how to diagnose stonith death match?

2013-01-08 Thread Greg Woods
On Tue, 2013-01-08 at 09:18 +1100, Andrew Beekhof wrote: On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote: The problem is that either node can come up and run all the resources, but as soon as I bring the other node online, it briefly looks normal, but as soon as the stonith resource

Re: [Linux-HA] how to diagnose stonith death match?

2013-01-08 Thread Greg Woods
On Wed, 2013-01-09 at 13:15 +1100, Andrew Beekhof wrote: IIRC, part of the activation involves tearing down the normal interface and creating the bridge. At this point the device heartbeat was talking to is gone. I hadn't thought of that, because afterwards, ethX looks exactly the same as it

Re: [Linux-HA] how to diagnose stonith death match?

2013-01-03 Thread Greg Woods
On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote: The problem is that either node can come up and run all the resources, but as soon as I bring the other node online, it briefly looks normal, but as soon as the stonith resource starts, the currently running node gets fenced and the new

Re: [Linux-HA] Some novice questions?

2013-01-02 Thread Greg Woods
On Tue, 2013-01-01 at 14:58 +0330, Ali Masoudi wrote: Is it mandatory to use same ha.cf on both nodes? I don't think it is absolutely mandatory, but it is best practice. Unless you really know what you are doing, you can run into difficulties getting heartbeat to work properly if the ha.cf

Re: [Linux-HA] Some novice questions?

2012-12-31 Thread Greg Woods
On Mon, 2012-12-31 at 15:09 +0330, Ali Masoudi wrote: ucast eth3 192.168.50.17 If you are using ucast, then you need one line for each node's IP in the ha.cf file. Either that or different ha.cf files on each node. What is needed is the IP of the other node, but heartbeat is smart enough to

[Linux-HA] how to diagnose stonith death match?

2012-12-28 Thread Greg Woods
I did some reconfiguration of the NICs and IP addresses on my 2-node test cluster (running heartbeat and Pacemaker on CentOS 5, slightly old versions but they have been working fine up to now on this and several other clusters). I am sure that the NIC configuration is correct and that the CIB has

Re: [Linux-HA] Custom resource agent script assistance

2011-12-01 Thread Greg Woods
On Thu, 2011-12-01 at 13:25 -0400, Chris Bowlby wrote: Hi Everyone, I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster This doesn't really address your specific question, but I got dhcpd to work by using the ocf:heartbeat:anything RA. primitive dhcp

Re: [Linux-HA] Monitoring only across WAN

2011-06-20 Thread Greg Woods
On Mon, 2011-06-20 at 17:47 +0800, Emmanuel Noobadmin wrote: The objective is to achieve sub minute monitoring of services like httpd and exim/dovecot so that I can run a script to notify/SMS myself when one of the machines fails to respond. Right now I'm just running a cron script every few

Re: [Linux-HA] cat /dev/ttyS0

2011-05-23 Thread Greg Woods
On Mon, 2011-05-23 at 13:59 -0700, Hai Tao wrote: this might not be too close to HA, but I am not sure if someone has seem this before: I use a serial cable between two nodes, and I am testing the heartbeat with : server2$ cat /dev/ttyS0 server1$ echo hello /dev/ttyS0 instead

Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?

2011-04-04 Thread Greg Woods
On Mon, 2011-04-04 at 11:44 -0500, Neil Aggarwal wrote: From what I can figure out from the ha.cf file, heartbeat uses ping to tell if the peer is up. Not really. It uses special heartbeat packets to tell if the peer is up. Ping is used to tell the difference between a dead peer and a bad NIC

Re: [Linux-HA] Does heartbeat only use ping to check health of otherserver?

2011-04-04 Thread Greg Woods
On Mon, 2011-04-04 at 13:38 -0500, Neil Aggarwal wrote: crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.9.101 cidr_netmask=32 \ op monitor interval=30s Does that mean heartbeat is being used to detect when to move the IP address to the standby server?

Re: [Linux-HA] problem with DRBD-based resource

2010-12-29 Thread Greg Woods
On Wed, 2010-12-29 at 12:56 +0100, Dejan Muhamedagic wrote: Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op: Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090 op=vmgroup1:0_stop_0 ) Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:

Re: [Linux-HA] problem with DRBD-based resource

2010-12-29 Thread Greg Woods
On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote: I updated one of my clusters today, and among other things, I updated from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related or not. Turns out it is. I downgraded the idle node to 1.0.9 and started heartbeat

[Linux-HA] problem with DRBD-based resource

2010-12-28 Thread Greg Woods
I updated one of my clusters today, and among other things, I updated from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related or not. The problem is that I cannot get the cluster to come up clean. Right now all resources are running on one node and it is OK that way. As soon as I

Re: [Linux-HA] strange crm behavior

2010-12-21 Thread Greg Woods
On Tue, 2010-12-21 at 12:09 +0100, Dejan Muhamedagic wrote: Could it be that the status shown below is part of a node status which is not in the cluster any more? Or a node which is down? No, that is not possible. This is a two-node cluster and both nodes have been up for many days and are

Re: [Linux-HA] strange crm behavior

2010-12-20 Thread Greg Woods
On Mon, 2010-12-20 at 12:40 +0100, Dejan Muhamedagic wrote: That's strange. resource cleanup should definitely remove the LRM (status) part. Can you please try again and then do: # cibadmin -Q | grep VM-paranfsvm It seems like it is not removing status info for old removed resources:

Re: [Linux-HA] Multiple stonith and Heartbeat 2.1.4

2010-11-18 Thread Greg Woods
On Thu, 2010-11-18 at 14:46 +0100, Sébastien Prud'homme wrote: I'm using meatware as a second stonith resource I'm doing this and it works fine. Unfortunately after several tests, i didn't find a way to make it work: only the first stonith ressource is used (and fails), the cluster enter

Re: [Linux-HA] debugging resource configuration

2010-11-03 Thread Greg Woods
On Wed, 2010-11-03 at 11:13 +0100, Dejan Muhamedagic wrote: ERROR with rpm_check_debug vs depsolve: heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64 Complete! (1, [u'Please report this error in https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise

Re: [Linux-HA] debugging resource configuration

2010-11-02 Thread Greg Woods
On Tue, 2010-11-02 at 11:11 +0100, Dejan Muhamedagic wrote: If you're using resource-agents, the package should be named ldirectord not heartbeat-ldirectord. The two packages should also have the same release numbers, probably something like 1.0.3-x. I figured as much. But there appears to be

Re: [Linux-HA] debugging resource configuration

2010-11-02 Thread Greg Woods
On Tue, 2010-11-02 at 22:24 +0100, Lars Ellenberg wrote: ldirectord package PROVIDES heartbeat-ldirectord and CONFLICTS with heartbeat-ldirectord. ldirectord package' spec has self-conflict. This is a patch for the problem. --- resource-agents.spec +++ resource-agents.spec

Re: [Linux-HA] debugging resource configuration

2010-10-29 Thread Greg Woods
On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote: Just a shot in the dark here kind of ... but I know that when I had this type of problem with a stonith device it was timeout related. You could try boosting your timeouts all around, or even check what # time /usr/sbin/ldirectord

Re: [Linux-HA] ldirectord on CentOS 5

2010-10-29 Thread Greg Woods
On Fri, 2010-10-29 at 12:09 +0900, Masashi Yamaguchi wrote: I think ldirectord rpm package's spec for RedHat/CentOS is inconsistent. $ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm config(ldirectord) =3D 1.0.3-2.el5 heartbeat-ldirectord ldirectord =3D 1.0.3-2.el5 $ rpm -qp

[Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods
I currently have an old heartbeat v1 cluster that I am moving to a newer Pacemaker/heartbeat v3 cluster. That is, I am moving the functionality of the old cluster to the new one so that the old one can be phased out. The new cluster is running all the latest stuff from the clusterlabs repo under

Re: [Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods
The same thing happens if I disable the extras repo, and even if I do yum clean all first. If instead I try to install heartbeat-ldirectord and disable the clusterlabs repo (which might result in a package that doesn't work right in any event), I get a different error: Transaction

Re: [Linux-HA] ldirectord on CentOS 5

2010-10-28 Thread Greg Woods
On Thu, 2010-10-28 at 14:52 -0600, Greg Woods wrote: I am a little confused. I was actually more confused than I thought. When I got this error: Failed actions: ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5, status=complete): not installed ldirectord_monitor_0 (node

[Linux-HA] debugging resource configuration

2010-10-28 Thread Greg Woods
This is a continuation of trying to get ldirectord working under pacemaker. I have a working installation of ldirectord. I know this because if I manually configure the eth0:0 pseudo-interface with the virtual server address, and manually start ldirectord with # /usr/sbin/ldirectord

Re: [Linux-HA] heartbeat with postgresql

2010-10-22 Thread Greg Woods
On Fri, 2010-10-22 at 18:32 +0200, Andrew Beekhof wrote: if you're just using v1 - thats not a cluster, thats a prayer. Then God must answer my prayers, because I have been using some simple heartbeat v1/DRBD clusters for YEARS, for critical services like DNS. They have worked flawlessly and

Re: [Linux-HA] heartbeat with postgresql

2010-10-20 Thread Greg Woods
On Wed, 2010-10-20 at 08:13 +0200, Andrew Beekhof wrote: Um, maybe because heartbeat v1 has a much much much much less steep learning curve? I dispute that: http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple This addresses the fact that

Re: [Linux-HA] heartbeat with postgresql

2010-10-19 Thread Greg Woods
On Tue, 2010-10-19 at 10:01 -0600, Serge Dubrouski wrote: Any particular reason for using Heartbeat v1 instead of CRM/Pacemaker? Um, maybe because heartbeat v1 has a much much much much less steep learning curve? If you have a simple two-node cluster where one node is just a hot spare, it is way

Re: [Linux-HA] Standby Node Refuses to Take Over

2010-09-27 Thread Greg Woods
On Mon, 2010-09-27 at 09:43 -0700, Robinson, Eric wrote: I went so far as to turn off the primary, but the standby still never took over. Do you have STONITH configured? I have run into this too. The primary will not take over unless it is told somehow that the secondary is really and

Re: [Linux-HA] Standby Node Refuses to Take Over

2010-09-27 Thread Greg Woods
On Mon, 2010-09-27 at 12:16 -0700, Robinson, Eric wrote: Not sure if you noticed in my previous message that I did physically power down the primary but the standby refused to take any action. Yes, I did notice that. My point is that I have noted on my clusters that simply powering it down

Re: [Linux-HA] node standby attribute and crm (SOLVED Partially)

2010-09-24 Thread Greg Woods
On Fri, 2010-09-24 at 11:34 -0600, Greg Woods wrote: # crm node show vmserve2.scd.ucar.edu(16fde08d-b4b6-4550-adfb-b3aab83f706f): normal standby: off vmserve.scd.ucar.edu(6f5ced83-a790-4519-8449-3d4cf43275b0): normal standby: off On the second cluster: # crm node show

Re: [Linux-HA] Adding DHCPD and NAMED as resources

2010-09-09 Thread Greg Woods
On Thu, 2010-09-09 at 16:35 +0100, Daniel Machado Grilo wrote: Another way to do this is if you choose LSB instead of OCF category primitives. That way you just select the init script from your init.d and thats it. You do need to ensure that your init script is LSB compliant. This includes

Re: [Linux-HA] Adding DHCPD and NAMED as resources

2010-09-08 Thread Greg Woods
On Wed, 2010-09-08 at 14:18 -0500, Bradley Leduc wrote: Am trying to add NAMED and DHCPD services as a resource on heartbeat-3.0.1-1.el5 cluster with no luck, I was wondering if anyone would know of an easy to do this. Any help would be great. Are you running pacemaker or just a heartbeat

Re: [Linux-HA] problem with static routes

2010-08-23 Thread Greg Woods
On Sun, 2010-08-22 at 10:25 -0600, Greg Woods wrote: The basic problem is that when I reboot a node in my cluster, it comes back up without its static routes. I have determined through experimentation that it is the setup/teardown of Xen networking that is causing this. The static routes also

[Linux-HA] problem with static routes

2010-08-22 Thread Greg Woods
OS: CentOS 5.5 heartbeat: heartbeat-3.0.3-2.3.el5 (latest from clusterlabs) pacemaker: pacemaker-1.0.9.1-1.15.el5 (latest from clusterlabs) If it matters, this cluster is primarily used to run Xen virtual machines (xen-3.0.3-105.el5_5.5 kernel-2.6.18-194.11.1.el5xen latest from CentOS) I have

Re: [Linux-HA] Question about grouping with a clone inside group ?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:09 +0200, Alain.Moulle wrote: crm configure colocation coloc1 +INFINITY:group1 clone-fs1 This says that group1 and clone-fs1 have to be on the same machine. That prohibits starting clone-fs1 on a machine where group1 is not running. That isn't what you meant. I

Re: [Linux-HA] time to fork heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:52 -0400, Peter Sylvester wrote: I do have to agree. I've actually been working for almost 4 business days now on trying to get Heartbeat and Pacemaker working together It took me six months to build a decent cluster, starting as one who was very experienced with

Re: [Linux-HA] Am I even on the right track here with Heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 17:13 -0500, Dimitri Maziuk wrote: So is it not practical to run RHEL or CentOS 5.x where you'd get this version and several more years of disto maintenance? It's not practical if you want to have both distro maintenance or cluster support. I run CentOS 5.5, and

Re: [Linux-HA] time to fork heartbeat?

2010-08-11 Thread Greg Woods
On Thu, 2010-08-12 at 00:38 +0200, Dejan Muhamedagic wrote: On Wed, Aug 11, 2010 at 09:53:01PM -, Yan Seiner wrote: Heck, it really should just take two things: 1. IP of remote computer 2. Device to use Device? Bang, it just works. For many of us this would be

Re: [Linux-HA] Am I even on the right track here with Heartbeat?

2010-08-11 Thread Greg Woods
On Wed, 2010-08-11 at 20:01 -0500, Dimitri Maziuk wrote: 1) there are installations where throwing in a package from 3rd party repo will cost you a lot. Like tech. support on a very very expensive piece of hardware. (Think giant hardon collider type of hardware.) Sure, there are some

Re: [Linux-HA] Waiting for confirmation before failover on the backup server

2010-07-23 Thread Greg Woods
On Fri, 2010-07-23 at 06:24 -0700, Mahadevan Iyer wrote: When using only heartbeat(no pacemaker) is there a way to do the following Setup a backup server such that when it tries to take over due to loss of connectivity with the main server, it waits for confirmation from an operator This

Re: [Linux-HA] Manual intervention on failover

2010-07-15 Thread Greg Woods
On Thu, 2010-07-15 at 07:36 -0700, Pushkar Pradhan wrote: Hi, I have a strange requirement: I don't want failover to happen unless a operator says go ahead or a big timeout has occurred (e.g. 1 hour). I am using Heartbeat R1 style cluster with 2 nodes. Is this possible or do I need to write

Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-28 Thread Greg Woods
On Mon, 2010-06-28 at 10:47 +0200, Dejan Muhamedagic wrote: (drbd_xen2:1:probe:stderr) DRBD module version: 8.3.8userland version: 8.3.6 you should upgrade your drbd tools! I guess that you should follow this advice. Just one data point: I get this message in my logs too, but DRBD

Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
On Sun, 2010-06-27 at 03:02 -0700, Joe Shang wrote: Failed actions: drbd_xen2:1_start_0 (node=xen1.box.com, call=10, rc=5, status=complete): not installed This is one of the things that I don't like about heartbeat/pacemaker. A minor error (misconfiguring a single resource) can cause

Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
On Sun, 2010-06-27 at 07:57 -0700, Joe Shang wrote: Jun 27 10:51:49 xen1 lrmd: [3949]: info: RA output: (drbd_xen2:1:probe:stderr) 'xen2' not defined in your config. This looks like an error in your DRBD configuration. What is in drbd.conf? What does drbd-overview or drbdadm state all show?

Re: [Linux-HA] 3 node cluster keeps failing after domU image is started

2010-06-27 Thread Greg Woods
You could try making one of them primary: # drbdadm primary xen1 If that doesn't work, you may have encountered a split brain situation. In that case, you have to tell DRBD that it is OK for one of the machines to discard the data it has so that the other one can become primary. Look here:

Re: [Linux-HA] stonith external/rackpdu question

2010-05-20 Thread Greg Woods
On Thu, 2010-05-20 at 18:30 +0100, Alexander Fisher wrote: I think I'll use IPMI and rackpdu in the same configuration. That is exactly what I will eventually try (assuming I ever get any time to work on my test cluster some more). It is clear that, no matter what I do, I cannot prepare for

Re: [Linux-HA] stonith external/rackpdu question

2010-05-18 Thread Greg Woods
Do you know that on APC PDUs, you can group outlets across several physical PDUs? I've got a bit more testing to do, but this seems to work ok. The plugin is configured to talk to just one outlet on one of the PDUs and the PDU does the rest. No, I didn't know you could do this. I will have

Re: [Linux-HA] stonith external/rackpdu question

2010-05-05 Thread Greg Woods
On Wed, 2010-05-05 at 13:29 +0200, Dejan Muhamedagic wrote: If these servers have a lights-out device and the power distribution is fairly reliable, that could be an alternative for fencing. They do have an IPMI device and it does work. I am trying to insulate against a failure of the NIC or

[Linux-HA] stonith external/rackpdu question

2010-05-04 Thread Greg Woods
We have a pair of servers in a cluster plugged into a pair of APC rack-mounted PDU's of the sort that could be controlled by this stonith plugin. My problem is that these are dual power supply servers, which means I would have to shut down two outlets that are on on two different PDUs to

Re: [Linux-HA] Heartbeat doesn't see other nodes in cluster

2010-04-15 Thread Greg Woods
On Wed, 2010-04-14 at 16:24 -0700, Stephen Punak wrote: Heartbeat appears to start just fine on all nodes, but none of them see each other. Any chance there is a firewall blocking the heartbeat packets? You'd still see them with wireshark, but they would be blocked from getting to the

Re: [Linux-HA] manual fencing

2010-04-11 Thread Greg Woods
On Thu, 2010-04-08 at 17:46 +0200, Dejan Muhamedagic wrote: Does this help? $ crm ra info stonith:meatware Yes, it does! Thank you! --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org

Re: [Linux-HA] trouble with CRM/XEN

2010-04-07 Thread Greg Woods
On Wed, 2010-04-07 at 15:39 +0200, Andrew Beekhof wrote: I increased the timeout even further (to 120s instead of the minimum recommended 60) and it seems to be working. Curious though, because when it does work, the logs show that the entire stop operation, including a live migration,

Re: [Linux-HA] manual fencing

2010-04-07 Thread Greg Woods
On Wed, 2010-04-07 at 12:13 +0200, Dejan Muhamedagic wrote: There's not much magic, you just configure two stonith resources and assign different priority, then they'll be tried in a round-robin fashion. For instance: primitive st-real stonith:ipmilan \ params ... priority=10 # it

Re: [Linux-HA] order directives and pacemaker update

2010-04-06 Thread Greg Woods
On Tue, 2010-04-06 at 12:29 +0200, Dejan Muhamedagic wrote: There's a crm shell bug in 1.0.8-2 in the validation process. Either revert to the earlier pacemaker or apply this patch: http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/422fed9d8776 OK, that's a relief. I chose to apply the

[Linux-HA] manual fencing

2010-04-06 Thread Greg Woods
I'm looking for a good way to deal with the total power drop case. I am using an iDrac 6 as a stonith device on a pair of Dell R710 servers. I tried the power drop test today by simply unplugging the power on one of the nodes. What happens in this case is that the attempt by the other node to

Re: [Linux-HA] manual fencing

2010-04-06 Thread Greg Woods
On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote: I think the solution is using UPS or PDU for STONITH device. That could improve things in some scenarios, but it does not completely solve the problem. The cluster is still vulnerable to having the entire power strip for one node unlpugged or

Re: [Linux-HA] trouble with stonith [SOLVED]

2010-04-05 Thread Greg Woods
On Sat, 2010-04-03 at 22:55 +0200, Dejan Muhamedagic wrote: That should've probably caused a connection timeout and this message: IPMI operation timed out... :( Was there such a message in the log? Now that I know to look for it, yes. So far I am having a great deal of difficulty sifting

Re: [Linux-HA] trouble with CRM/XEN

2010-04-05 Thread Greg Woods
On Sat, 2010-04-03 at 22:45 +0200, Dejan Muhamedagic wrote: I spoke too soon; now I am getting failures when stopping the Xen resources manually as well. I can't get both nodes online at the same time unless I disable stonith. There should be something in the logs. grep for lrmd and the

[Linux-HA] order directives and pacemaker update

2010-04-05 Thread Greg Woods
Since I applied the most recent Pacemaker update last Friday (now running pacemaker-1.0.8-2.el5.x86_64 on CentOS 5), I can no longer enter order directives. I am using the exact same syntax that I used previously, and the syntax matches some existing directives, but the crm shell won't take it.

Re: [Linux-HA] trouble with stonith [SOLVED]

2010-04-02 Thread Greg Woods
On Thu, 2010-04-01 at 15:38 -0600, Greg Woods wrote: node1# stonith -t ipmilan -F node2-param-file node2 This works both ways; the remote node reboots. So I should be able to rule out DRAC configuration issues. I have also checked, double-checked, and triple checked that the parameters

[Linux-HA] trouble with CRM/XEN

2010-04-02 Thread Greg Woods
I am having difficulty achieving a clean failover in a Pacemaker 1.0.7 cluster that is mainly there to run Xen virtual machines. I realize that nobody can tell me exactly what is wrong without seeing an awful lot of configuration detail; what I am looking for is more like some general methods I

[Linux-HA] where is CRM 'default timeout' set?

2010-04-02 Thread Greg Woods
In the process of trying to fix two other problems, I messed something up badly. Now when I go into the crm shell to edit the configuration, on verify I get a message like this for every one of my configured resources: WARNING: vm-ip1: default timeout 20s for start is smaller than the advised 90

Re: [Linux-HA] trouble with CRM/XEN

2010-04-02 Thread Greg Woods
On Fri, 2010-04-02 at 13:02 -0600, Greg Woods wrote: In a nutshell: if I manually stop all the Xen resources first with a command like crm resource stop vmname), then failover works perfectly, I spoke too soon; now I am getting failures when stopping the Xen resources manually as well. I can't

Re: [Linux-HA] where is CRM 'default timeout' set? [SOLVED]

2010-04-02 Thread Greg Woods
On Fri, 2010-04-02 at 13:53 -0600, Greg Woods wrote: WARNING: vm-ip1: default timeout 20s for start is smaller than the advised 90 Found the answer for this one in the gossamer-threads for the pacemaker list. Should have thought of looking there first. For those who are struggling

[Linux-HA] trouble with stonith

2010-04-01 Thread Greg Woods
I'm trying to get stonith to work on a two-node cluster using Dell iDrac. If I run stonith manually with a command like: node1# stonith -t ipmilan -F node2-param-file node2 This works both ways; the remote node reboots. So I should be able to rule out DRAC configuration issues. I have also

Re: [Linux-HA] NFS and DRBD

2010-03-23 Thread Greg Woods
On one node, i can get all services to start(and they work fine), but whenever fail over occurs, there's nfs related handles left open thus inhibiting/hanging the fail over. more specifically, the file systems fails to unmount. If you are referring to file systems on the server that are

Re: [Linux-HA] getting stonith working [SOLVED]

2010-03-09 Thread Greg Woods
On Mon, 2010-03-08 at 16:56 +0100, Dejan Muhamedagic wrote: Hi, On Fri, Mar 05, 2010 at 03:07:45PM -0700, Greg Woods wrote: Partially solved, anyway. Glad you got it solved, but why do you say partially? Because I managed to get it working without ever figuring out exactly what it was I

[Linux-HA] getting stonith working

2010-03-05 Thread Greg Woods
I am in the process of climbing the learning curve for Pacemaker. I'm using RPMs from clusterlabs on CentOS 5: heartbeat-3.0.2-2.el5 pacemaker-1.0.7-4.el5 It has been a long hard struggle, but I have mostly gotten my two-node cluster working. But I've hit a wall trying to get stonith to work. I

Re: [Linux-HA] getting stonith working [SOLVED]

2010-03-05 Thread Greg Woods
Partially solved, anyway. On Fri, 2010-03-05 at 12:52 -0700, Greg Woods wrote: crm(live)configure# primitive stonith-vm1 stonith:ipmilan params auth=straight hostname=node1.fqdn ipaddr=** login=* password=* priv=admin port=23 crm(live)configure# location nosuicide-vm1 stonith-vm1

Re: [Linux-HA] Backup of SVN repositories

2008-12-03 Thread Greg Woods
On Wed, 2008-12-03 at 21:23 +, Todd, Conor (PDX) wrote: I can't do this using a crontab because one never knows which host will be running the SVN service (and have the disks mounted for it). Has anyone else tackled this issue yet? You may not know in advance whether a given host is

Re: [Linux-HA] HA of slpad (openLDAP) with Heartbeat and IPAddr.

2008-11-24 Thread Greg Woods
On Mon, 2008-11-24 at 18:47 +0530, Divyarajsinh Jadeja wrote: Hi, I am new to Heartbeat. How can we configure openLDAP with Heartbeat for High-availability of Authentication.? I need to have slapd running on both the machine because, ldap replication needs slapd on both node. I

Re: [Linux-HA] Running OpenLDAP and MySQL w/Linux HA

2008-11-20 Thread Greg Woods
On Wed, 2008-11-19 at 13:50 -0800, Rob Tanner wrote: The next thing to try is OpenLDAP and MySQL, both of which are critical services and both of which are for more complex. Is anyone running them on Linux HA. Does it work reliably when you switch, etc. How do you have it all

Re: [Linux-HA] New user questions, config file locations and hb_gui

2008-10-23 Thread Greg Woods
On Thu, 2008-10-23 at 10:48 -0600, Landon Cox wrote: b) apache, postgresql, mysql and some custom services are always running on both machines to reduce startup times on failover You might want to carefully consider the tradeoff here. Getting two-way database replication to work reliably

Re: [Linux-HA] New user questions, config file locations and hb_gui

2008-10-23 Thread Greg Woods
On Thu, 2008-10-23 at 13:24 -0600, Landon Cox wrote: Do you really have an application where you can't even afford a few seconds down time at failover? No. Anything sub-60 seconds would be tolerated. In that case, I really think it will be easier to set up DRBD. That way you can

[Linux-HA] USB serial links?

2008-10-17 Thread Greg Woods
I am aware that heartbeat can be done over USB links using USB-Ethernet interfaces. I specifically do not want to do that, because I am looking for a heartbeat link that will be independent of the IP stack, but on machines that do not have native serial ports. So I got a couple of Keyspan

[Linux-HA] heartbeat gets into weird state

2008-10-16 Thread Greg Woods
I've been using heartbeat for years, since the 1.0 days, and I've never seen anything quite like this before. I'm running heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository) on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2 features though, it's a standard