On Thu, Feb 9, 2017 at 2:28 AM, Ferenc Wágner wrote:
> Looks like your VM resource was destroyed (maybe due to the xen balloon
> errors above), and the monitor operation noticed this.
>
Thank you for helping me interpret that. I think what happened is that the
VM in question
For the second time in a few weeks, we have had one node of a particular
cluster getting fenced. It isn't totally clear why this is happening. On
the surviving node I see:
Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb 2 16:48:52
I notice that the network:ha-clustering:Stable repo for CentOS 6 now
contains Corosync 2.3.3-1 . I am currently running 1.4.1-17 . Is it safe to
just run this update? Are there configuration changes I have to make in
order for the new version to work? (If there is a document or wiki page
On Wed, Oct 1, 2014 at 8:44 AM, Digimer li...@alteeve.ca wrote:
Personally, I would not upgrade. If you do, you will want to test outside
of production first.
Of course, I would always do that anyway, even without a major version
number change.
Corosync needed cman to be a quorum
On Wed, Oct 1, 2014 at 2:04 PM, Digimer li...@alteeve.ca wrote:
Who runs the repo? It's not a name I am familiar with.
It comes from opensuse.org . I'm pretty sure I got it out of one of the
documents on the clusterlabs site, but I would have to go back and verify
that to be certain.
--Greg
On Fri, 2014-02-21 at 12:37 +, Tony Stocker wrote:
colocation inf_ftpd inf: infra_group ftpd
or do I need to use an 'order' statement instead, i.e.:
order ftp_infra mandatory: infra_group:start ftpd
I'm far from a leading expert on this, but in my experience,
On 08/14/2013 02:12 PM, Fredrik Hudner wrote:
I have tried to make one node primary but only get:
0: State change failed: (-2) Need access to UpToDate data
Command 'drbdsetup primary 0' terminated with exit code 17
When you've suffered a sudden disconnect, you can get into a situation
where
I have two clusters that are both running CentOS 5.6 and
heartbeat-3.0.3-2.3.el5 (from the clusterlabs repo). THey are running
slightly different pacemaker versions (pacemaker-1.0.9.1-1.15.el5 on the
first one and pacemaker-1.0.12-1.el5 on the other) They both have
identical ha.cf files except
I know it's tacky to reply to myself, but I can answer one of my
questions after another 15 minutes or so of poring through logs:
On Tue, 2013-05-28 at 10:37 -0600, Greg Woods wrote:
The questions are what do these messages actually mean, why is one
cluster logging them and not the other
On Wed, 2013-05-29 at 07:50 +1000, Andrew Beekhof wrote:
respawn hacluster /usr/lib64/heartbeat/ipfail
crm respawn
I don't know about the rest, but definitely do not use both ipfail and crm.
Pick one :)
I guess I will have to look into what ipfail really does. I have a half
dozen
On Fri, 2013-05-24 at 10:45 +0200, Ulrich Windl wrote:
You are still mixing total migration time (which may be minutes) with virtual
stand-still time (which is a few seconds).
Correct. It was not clear (to me) that when the time to migrate was
several minutes, the actual service outage was
On Thu, 2013-05-23 at 15:00 -0400, David Vossel wrote:
Migration time, depending on network speed and hardware, is much longer than
the shared storage option (minutes vs. seconds).
This is just one data point (of course), but for the vast majority of
services that I run, if the live
On Wed, 2013-04-24 at 08:48 +0200, Ulrich Windl wrote:
Greg Woods wo...@ucar.edu schrieb am 23.04.2013 um 21:20 in Nachricht
Apr 19 17:02:22 vmn2 kernel: block drbd0: Terminating asender thread
Apr 19 17:02:22 vmn2 kernel: block drbd0: Connection closed
Apr 19 17:02:22 vmn2 kernel: block
On Wed, 2013-04-24 at 12:11 +0200, Lars Ellenberg wrote:
drbd[25887]:2013/04/19_17:02:07 DEBUG: vmgroup2:
Calling /usr/sbin/crm_master -Q -l reboot -v 1
I apologize for the noise about this. Further checks of the logs on all
my clusters show that this is normal behavior. I started
On Mon, 2013-04-22 at 09:50 -0600, Greg Woods wrote:
On Mon, 2013-04-22 at 10:12 +1000, Andrew Beekhof wrote:
On Saturday, April 20, 2013, Greg Woods wrote:
Often one of the
nodes gets stuck at Stopping HA Services
That means pacemaker is waiting for one of your resources to stop
Here's a new issue. We have had two outages, about 3 weeks apart, on one
of our Heartbeat/Pacemaker/DRBD two-node clusters. In both cases, this
was logged:
Apr 19 17:02:22 vmn2 kernel: block drbd0: PingAck did not arrive in
time.
Apr 19 17:02:22 vmn2 kernel: block drbd0: peer( Primary - Unknown )
On Fri, 2013-04-19 at 16:43 +0200, Florian Crouzat wrote:
crm configure property
OK, thanks for the suggestions. What is the difference between
maintenance-mode=true and stop-all-resources=true? I tried the
latter first, and all the resources do stop, except that all the stonith
resources are
I realize that nobody can solve a problem based on a single log entry,
but I am trying to understand what happened with a cluster problem
today. A similar thing happened with this cluster about 3 weeks ago, so
this is one of those hard-to-solve intermittent issues. But it might
help me now if I
On Sun, 2013-03-24 at 01:36 -0700, tubaguy50035 wrote:
params ipv6addr=2600:3c00::0034:c007 nic=eth0:3 \
Are you sure that's a valid IPV6 address? I get headaches every time I
look at these, but it seems a valid address is 8 groups, and you've got
5 there. Maybe you mean
On Thu, 2013-01-31 at 02:09 +, Robinson, Eric wrote:
the secondary should wait for a manual command to become primary.
That can be accomplished with the meatware STONITH device. Requires a
command to be run to tell the wannabe primary that the secondary is
really dead (and, of course, you
On Thu, 2013-01-10 at 08:35 +1100, Andrew Beekhof wrote:
On Wed, Jan 9, 2013 at 4:16 PM, Greg Woods wo...@ucar.edu wrote:
I got the cluster running with xend by
moving the heartbeat to a different interface.
Having heartbeat start after the bridge is created _should_ also work
On Tue, 2013-01-08 at 09:18 +1100, Andrew Beekhof wrote:
On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote:
The problem is that either node can come up and run all the resources,
but as soon as I bring the other node online, it briefly looks normal,
but as soon as the stonith resource
On Wed, 2013-01-09 at 13:15 +1100, Andrew Beekhof wrote:
IIRC, part of the activation involves tearing down the normal
interface and creating the bridge.
At this point the device heartbeat was talking to is gone.
I hadn't thought of that, because afterwards, ethX looks exactly the
same as it
On Fri, 2012-12-28 at 14:54 -0700, Greg Woods wrote:
The problem is that either node can come up and run all the resources,
but as soon as I bring the other node online, it briefly looks normal,
but as soon as the stonith resource starts, the currently running node
gets fenced and the new
On Tue, 2013-01-01 at 14:58 +0330, Ali Masoudi wrote:
Is it mandatory to use same ha.cf on both nodes?
I don't think it is absolutely mandatory, but it is best practice.
Unless you really know what you are doing, you can run into difficulties
getting heartbeat to work properly if the ha.cf
On Mon, 2012-12-31 at 15:09 +0330, Ali Masoudi wrote:
ucast eth3 192.168.50.17
If you are using ucast, then you need one line for each node's IP in the
ha.cf file. Either that or different ha.cf files on each node. What is
needed is the IP of the other node, but heartbeat is smart enough to
I did some reconfiguration of the NICs and IP addresses on my 2-node
test cluster (running heartbeat and Pacemaker on CentOS 5, slightly old
versions but they have been working fine up to now on this and several
other clusters). I am sure that the NIC configuration is correct and
that the CIB has
On Thu, 2011-12-01 at 13:25 -0400, Chris Bowlby wrote:
Hi Everyone,
I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster
This doesn't really address your specific question, but I got dhcpd to
work by using the ocf:heartbeat:anything RA.
primitive dhcp
On Mon, 2011-06-20 at 17:47 +0800, Emmanuel Noobadmin wrote:
The objective is to achieve sub minute monitoring of services like
httpd and exim/dovecot so that I can run a script to notify/SMS myself
when one of the machines fails to respond. Right now I'm just running
a cron script every few
On Mon, 2011-05-23 at 13:59 -0700, Hai Tao wrote:
this might not be too close to HA, but I am not sure if someone has seem this
before:
I use a serial cable between two nodes, and I am testing the heartbeat with :
server2$ cat /dev/ttyS0
server1$ echo hello /dev/ttyS0
instead
On Mon, 2011-04-04 at 11:44 -0500, Neil Aggarwal wrote:
From what I can figure out from the ha.cf file, heartbeat
uses ping to tell if the peer is up.
Not really. It uses special heartbeat packets to tell if the peer is up.
Ping is used to tell the difference between a dead peer and a bad NIC
On Mon, 2011-04-04 at 13:38 -0500, Neil Aggarwal wrote:
crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip=192.168.9.101 cidr_netmask=32 \
op monitor interval=30s
Does that mean heartbeat is being used to detect
when to move the IP address to the standby server?
On Wed, 2010-12-29 at 12:56 +0100, Dejan Muhamedagic wrote:
Dec 28 09:19:18 vmserve.scd.ucar.edu crmd: [7518]: info: do_lrm_rsc_op:
Performing key=21:2:0:fb701221-ba59-4de8-88dc-032cab9ec090
op=vmgroup1:0_stop_0 )
Dec 28 09:19:18 vmserve.scd.ucar.edu lrmd: [7514]: info:
On Tue, Dec 28, 2010 at 03:18:06PM -0700, Greg Woods wrote:
I updated one of my clusters today, and among other things, I updated
from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related
or not.
Turns out it is. I downgraded the idle node to 1.0.9 and started
heartbeat
I updated one of my clusters today, and among other things, I updated
from pacemaker-1.0.9 to 1.0.10. I don't know if that is directly related
or not.
The problem is that I cannot get the cluster to come up clean. Right now
all resources are running on one node and it is OK that way. As soon as
I
On Tue, 2010-12-21 at 12:09 +0100, Dejan Muhamedagic wrote:
Could it be that the status
shown below is part of a node status which is not in the cluster
any more? Or a node which is down?
No, that is not possible. This is a two-node cluster and both nodes have
been up for many days and are
On Mon, 2010-12-20 at 12:40 +0100, Dejan Muhamedagic wrote:
That's strange. resource cleanup should definitely remove the
LRM (status) part. Can you please try again and then do:
# cibadmin -Q | grep VM-paranfsvm
It seems like it is not removing status info for old removed resources:
On Thu, 2010-11-18 at 14:46 +0100, Sébastien Prud'homme wrote:
I'm using meatware as a second
stonith resource
I'm doing this and it works fine.
Unfortunately after several tests, i didn't find a way to make it
work: only the first stonith ressource is used (and fails), the
cluster enter
On Wed, 2010-11-03 at 11:13 +0100, Dejan Muhamedagic wrote:
ERROR with rpm_check_debug vs depsolve:
heartbeat-ldirectord conflicts with ldirectord-1.0.3-2.6.el5.x86_64
Complete!
(1, [u'Please report this error in
https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise
On Tue, 2010-11-02 at 11:11 +0100, Dejan Muhamedagic wrote:
If you're using resource-agents, the package should be named
ldirectord not heartbeat-ldirectord. The two packages should also
have the same release numbers, probably something like 1.0.3-x.
I figured as much. But there appears to be
On Tue, 2010-11-02 at 22:24 +0100, Lars Ellenberg wrote:
ldirectord package PROVIDES heartbeat-ldirectord and
CONFLICTS with heartbeat-ldirectord.
ldirectord package' spec has self-conflict.
This is a patch for the problem.
--- resource-agents.spec
+++ resource-agents.spec
On Thu, 2010-10-28 at 18:38 -0600, Eric Schoeller wrote:
Just a shot in the dark here kind of ... but I know that when I had this
type of problem with a stonith device it was timeout related. You could
try boosting your timeouts all around, or even check what
# time /usr/sbin/ldirectord
On Fri, 2010-10-29 at 12:09 +0900, Masashi Yamaguchi wrote:
I think ldirectord rpm package's spec for RedHat/CentOS is inconsistent.
$ rpm -qp --provides ldirectord-1.0.3-2.el5.x86_64.rpm
config(ldirectord) =3D 1.0.3-2.el5
heartbeat-ldirectord
ldirectord =3D 1.0.3-2.el5
$ rpm -qp
I currently have an old heartbeat v1 cluster that I am moving to a newer
Pacemaker/heartbeat v3 cluster. That is, I am moving the functionality
of the old cluster to the new one so that the old one can be phased out.
The new cluster is running all the latest stuff from the clusterlabs
repo under
The same thing happens if I disable the extras repo, and even if I do
yum clean all first. If instead I try to install heartbeat-ldirectord
and disable the clusterlabs repo (which might result in a package that
doesn't work right in any event), I get a different error:
Transaction
On Thu, 2010-10-28 at 14:52 -0600, Greg Woods wrote:
I am a little confused.
I was actually more confused than I thought. When I got this error:
Failed actions:
ldirectord_monitor_0 (node=vmx2.ucar.edu, call=137, rc=5,
status=complete): not installed
ldirectord_monitor_0 (node
This is a continuation of trying to get ldirectord working under
pacemaker. I have a working installation of ldirectord. I know this
because if I manually configure the eth0:0 pseudo-interface with the
virtual server address, and manually start ldirectord with
# /usr/sbin/ldirectord
On Fri, 2010-10-22 at 18:32 +0200, Andrew Beekhof wrote:
if you're just using v1 - thats not a cluster,
thats a prayer.
Then God must answer my prayers, because I have been using some simple
heartbeat v1/DRBD clusters for YEARS, for critical services like DNS.
They have worked flawlessly and
On Wed, 2010-10-20 at 08:13 +0200, Andrew Beekhof wrote:
Um, maybe because heartbeat v1 has a much much much much less steep
learning curve?
I dispute that:
http://theclusterguy.clusterlabs.org/post/178680309/configuring-heartbeat-v1-was-so-simple
This addresses the fact that
On Tue, 2010-10-19 at 10:01 -0600, Serge Dubrouski wrote:
Any particular reason for using Heartbeat v1 instead of CRM/Pacemaker?
Um, maybe because heartbeat v1 has a much much much much less steep
learning curve? If you have a simple two-node cluster where one node is
just a hot spare, it is way
On Mon, 2010-09-27 at 09:43 -0700, Robinson, Eric wrote:
I went so far as to turn off the primary, but the standby still never
took over.
Do you have STONITH configured?
I have run into this too. The primary will not take over unless it is
told somehow that the secondary is really and
On Mon, 2010-09-27 at 12:16 -0700, Robinson, Eric wrote:
Not sure if you noticed in my previous message that I did physically
power down the primary but the standby refused to take any action.
Yes, I did notice that. My point is that I have noted on my clusters
that simply powering it down
On Fri, 2010-09-24 at 11:34 -0600, Greg Woods wrote:
# crm node show
vmserve2.scd.ucar.edu(16fde08d-b4b6-4550-adfb-b3aab83f706f): normal
standby: off
vmserve.scd.ucar.edu(6f5ced83-a790-4519-8449-3d4cf43275b0): normal
standby: off
On the second cluster:
# crm node show
On Thu, 2010-09-09 at 16:35 +0100, Daniel Machado Grilo wrote:
Another way to do this is if you choose LSB instead of OCF category
primitives.
That way you just select the init script from your init.d and thats it.
You do need to ensure that your init script is LSB compliant. This
includes
On Wed, 2010-09-08 at 14:18 -0500, Bradley Leduc wrote:
Am trying to add NAMED and DHCPD services as a resource on
heartbeat-3.0.1-1.el5 cluster with no luck, I was wondering if anyone would
know of an easy to do this. Any help would be great.
Are you running pacemaker or just a heartbeat
On Sun, 2010-08-22 at 10:25 -0600, Greg Woods wrote:
The basic problem is that
when I reboot a node in my cluster, it comes back up without its static
routes.
I have determined through experimentation that it is the setup/teardown
of Xen networking that is causing this. The static routes also
OS: CentOS 5.5
heartbeat: heartbeat-3.0.3-2.3.el5 (latest from clusterlabs)
pacemaker: pacemaker-1.0.9.1-1.15.el5 (latest from clusterlabs)
If it matters, this cluster is primarily used to run Xen virtual
machines (xen-3.0.3-105.el5_5.5 kernel-2.6.18-194.11.1.el5xen latest
from CentOS)
I have
On Wed, 2010-08-11 at 17:09 +0200, Alain.Moulle wrote:
crm configure colocation coloc1 +INFINITY:group1 clone-fs1
This says that group1 and clone-fs1 have to be on the same machine. That
prohibits starting clone-fs1 on a machine where group1 is not running.
That isn't what you meant. I
On Wed, 2010-08-11 at 17:52 -0400, Peter Sylvester wrote:
I do have to agree. I've actually been working for almost 4 business days
now on trying to get Heartbeat and Pacemaker working together
It took me six months to build a decent cluster, starting as one who was
very experienced with
On Wed, 2010-08-11 at 17:13 -0500, Dimitri Maziuk wrote:
So is it not practical to run RHEL or CentOS 5.x where you'd get this
version and several more years of disto maintenance?
It's not practical if you want to have both distro maintenance or cluster
support.
I run CentOS 5.5, and
On Thu, 2010-08-12 at 00:38 +0200, Dejan Muhamedagic wrote:
On Wed, Aug 11, 2010 at 09:53:01PM -, Yan Seiner wrote:
Heck, it really should just take two things:
1. IP of remote computer
2. Device to use
Device?
Bang, it just works.
For many of us this would be
On Wed, 2010-08-11 at 20:01 -0500, Dimitri Maziuk wrote:
1) there are installations where throwing in a package from 3rd party repo
will cost you a lot. Like tech. support on a very very expensive piece of
hardware. (Think giant hardon collider type of hardware.)
Sure, there are some
On Fri, 2010-07-23 at 06:24 -0700, Mahadevan Iyer wrote:
When using only heartbeat(no pacemaker) is there a way to do the following
Setup a backup server such that when it tries to take over due to loss of
connectivity with the main server, it waits for confirmation from an operator
This
On Thu, 2010-07-15 at 07:36 -0700, Pushkar Pradhan wrote:
Hi,
I have a strange requirement: I don't want failover to happen unless a
operator says go ahead or a big timeout has occurred (e.g. 1 hour). I am
using Heartbeat R1 style cluster with 2 nodes.
Is this possible or do I need to write
On Mon, 2010-06-28 at 10:47 +0200, Dejan Muhamedagic wrote:
(drbd_xen2:1:probe:stderr) DRBD module version: 8.3.8userland
version: 8.3.6 you should upgrade your drbd tools!
I guess that you should follow this advice.
Just one data point: I get this message in my logs too, but DRBD
On Sun, 2010-06-27 at 03:02 -0700, Joe Shang wrote:
Failed actions:
drbd_xen2:1_start_0 (node=xen1.box.com, call=10, rc=5,
status=complete): not installed
This is one of the things that I don't like about heartbeat/pacemaker. A
minor error (misconfiguring a single resource) can cause
On Sun, 2010-06-27 at 07:57 -0700, Joe Shang wrote:
Jun 27 10:51:49 xen1 lrmd: [3949]: info: RA output:
(drbd_xen2:1:probe:stderr) 'xen2' not defined in your config.
This looks like an error in your DRBD configuration. What is in
drbd.conf? What does drbd-overview or drbdadm state all show?
You could try making one of them primary:
# drbdadm primary xen1
If that doesn't work, you may have encountered a split brain situation.
In that case, you have to tell DRBD that it is OK for one of the
machines to discard the data it has so that the other one can become
primary. Look here:
On Thu, 2010-05-20 at 18:30 +0100, Alexander Fisher wrote:
I think I'll use IPMI and rackpdu in the same configuration.
That is exactly what I will eventually try (assuming I ever get any time
to work on my test cluster some more).
It is clear that, no matter what I do, I cannot prepare for
Do you know that on APC PDUs, you can group outlets across several
physical PDUs? I've got a bit more testing to do, but this seems to work ok.
The plugin is configured to talk to just one outlet on one of the PDUs and the
PDU does the rest.
No, I didn't know you could do this. I will have
On Wed, 2010-05-05 at 13:29 +0200, Dejan Muhamedagic wrote:
If these servers have a lights-out device and the power
distribution is fairly reliable, that could be an alternative for
fencing.
They do have an IPMI device and it does work. I am trying to insulate
against a failure of the NIC or
We have a pair of servers in a cluster plugged into a pair of APC
rack-mounted PDU's of the sort that could be controlled by this stonith
plugin. My problem is that these are dual power supply servers, which
means I would have to shut down two outlets that are on on two different
PDUs to
On Wed, 2010-04-14 at 16:24 -0700, Stephen Punak wrote:
Heartbeat appears to start just fine on all nodes, but none of them see each
other.
Any chance there is a firewall blocking the heartbeat packets? You'd
still see them with wireshark, but they would be blocked from getting to
the
On Thu, 2010-04-08 at 17:46 +0200, Dejan Muhamedagic wrote:
Does this help?
$ crm ra info stonith:meatware
Yes, it does! Thank you!
--Greg
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
On Wed, 2010-04-07 at 15:39 +0200, Andrew Beekhof wrote:
I increased the timeout
even further (to 120s instead of the minimum recommended 60) and it
seems to be working. Curious though, because when it does work, the logs
show that the entire stop operation, including a live migration,
On Wed, 2010-04-07 at 12:13 +0200, Dejan Muhamedagic wrote:
There's not much magic, you just configure two stonith resources
and assign different priority, then they'll be tried in a
round-robin fashion. For instance:
primitive st-real stonith:ipmilan \
params ... priority=10 # it
On Tue, 2010-04-06 at 12:29 +0200, Dejan Muhamedagic wrote:
There's a crm shell bug in 1.0.8-2 in the validation process.
Either revert to the earlier pacemaker or apply this patch:
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/422fed9d8776
OK, that's a relief. I chose to apply the
I'm looking for a good way to deal with the total power drop case. I
am using an iDrac 6 as a stonith device on a pair of Dell R710 servers.
I tried the power drop test today by simply unplugging the power on one
of the nodes. What happens in this case is that the attempt by the other
node to
On Tue, 2010-04-06 at 14:58 -0700, Tony Gan wrote:
I think the solution is using UPS or PDU for STONITH device.
That could improve things in some scenarios, but it does not completely
solve the problem. The cluster is still vulnerable to having the entire
power strip for one node unlpugged or
On Sat, 2010-04-03 at 22:55 +0200, Dejan Muhamedagic wrote:
That should've probably caused a connection timeout and this
message:
IPMI operation timed out... :(
Was there such a message in the log?
Now that I know to look for it, yes. So far I am having a great deal of
difficulty sifting
On Sat, 2010-04-03 at 22:45 +0200, Dejan Muhamedagic wrote:
I spoke too soon; now I am getting failures when stopping the Xen
resources manually as well. I can't get both nodes online at the same
time unless I disable stonith.
There should be something in the logs. grep for lrmd and the
Since I applied the most recent Pacemaker update last Friday (now
running pacemaker-1.0.8-2.el5.x86_64 on CentOS 5), I can no longer
enter order directives. I am using the exact same syntax that I used
previously, and the syntax matches some existing directives, but the crm
shell won't take it.
On Thu, 2010-04-01 at 15:38 -0600, Greg Woods wrote:
node1# stonith -t ipmilan -F node2-param-file node2
This works both ways; the remote node reboots. So I should be able to
rule out DRAC configuration issues. I have also checked, double-checked,
and triple checked that the parameters
I am having difficulty achieving a clean failover in a Pacemaker 1.0.7
cluster that is mainly there to run Xen virtual machines. I realize that
nobody can tell me exactly what is wrong without seeing an awful lot of
configuration detail; what I am looking for is more like some general
methods I
In the process of trying to fix two other problems, I messed something
up badly. Now when I go into the crm shell to edit the configuration, on
verify I get a message like this for every one of my configured
resources:
WARNING: vm-ip1: default timeout 20s for start is smaller than the
advised 90
On Fri, 2010-04-02 at 13:02 -0600, Greg Woods wrote:
In a nutshell: if I manually stop all the Xen resources first with a
command like crm resource stop vmname), then failover works perfectly,
I spoke too soon; now I am getting failures when stopping the Xen
resources manually as well. I can't
On Fri, 2010-04-02 at 13:53 -0600, Greg Woods wrote:
WARNING: vm-ip1: default timeout 20s for start is smaller than the
advised 90
Found the answer for this one in the gossamer-threads for the
pacemaker list. Should have thought of looking there first.
For those who are struggling
I'm trying to get stonith to work on a two-node cluster using Dell
iDrac. If I run stonith manually with a command like:
node1# stonith -t ipmilan -F node2-param-file node2
This works both ways; the remote node reboots. So I should be able to
rule out DRAC configuration issues. I have also
On one node, i can get all services to start(and they work fine), but
whenever fail over occurs, there's nfs related handles left open thus
inhibiting/hanging the fail over. more specifically, the file systems fails
to unmount.
If you are referring to file systems on the server that are
On Mon, 2010-03-08 at 16:56 +0100, Dejan Muhamedagic wrote:
Hi,
On Fri, Mar 05, 2010 at 03:07:45PM -0700, Greg Woods wrote:
Partially solved, anyway.
Glad you got it solved, but why do you say partially?
Because I managed to get it working without ever figuring out exactly
what it was I
I am in the process of climbing the learning curve for Pacemaker. I'm
using RPMs from clusterlabs on CentOS 5:
heartbeat-3.0.2-2.el5
pacemaker-1.0.7-4.el5
It has been a long hard struggle, but I have mostly gotten my two-node
cluster working. But I've hit a wall trying to get stonith to work. I
Partially solved, anyway.
On Fri, 2010-03-05 at 12:52 -0700, Greg Woods wrote:
crm(live)configure# primitive stonith-vm1 stonith:ipmilan params
auth=straight hostname=node1.fqdn ipaddr=** login=*
password=* priv=admin port=23
crm(live)configure# location nosuicide-vm1 stonith-vm1
On Wed, 2008-12-03 at 21:23 +, Todd, Conor (PDX) wrote:
I can't do this using a crontab because one never knows which host
will be running the SVN service (and have the disks mounted for it).
Has anyone else tackled this issue yet?
You may not know in advance whether a given host is
On Mon, 2008-11-24 at 18:47 +0530, Divyarajsinh Jadeja wrote:
Hi,
I am new to Heartbeat. How can we configure openLDAP with Heartbeat for
High-availability of Authentication.?
I need to have slapd running on both the machine because, ldap replication
needs slapd on both node.
I
On Wed, 2008-11-19 at 13:50 -0800, Rob Tanner wrote:
The next thing to try is OpenLDAP and MySQL, both of which are critical
services and both of which are for more complex. Is anyone running them
on Linux HA. Does it work reliably when you switch, etc. How do you
have it all
On Thu, 2008-10-23 at 10:48 -0600, Landon Cox wrote:
b) apache, postgresql, mysql and some custom services are always
running on both machines to reduce startup times on failover
You might want to carefully consider the tradeoff here. Getting two-way
database replication to work reliably
On Thu, 2008-10-23 at 13:24 -0600, Landon Cox wrote:
Do you really have
an application where you can't even afford a few seconds down time at
failover?
No. Anything sub-60 seconds would be tolerated.
In that case, I really think it will be easier to set up DRBD. That way
you can
I am aware that heartbeat can be done over USB links using USB-Ethernet
interfaces. I specifically do not want to do that, because I am looking
for a heartbeat link that will be independent of the IP stack, but on
machines that do not have native serial ports. So I got a couple of
Keyspan
I've been using heartbeat for years, since the 1.0 days, and I've never
seen anything quite like this before. I'm running
heartbeat-2.1.3-3.el5.centos (RPM from the CentOS standard repository)
on an x86_64 machine running (obviously) CentOS 5. I'm not using the v2
features though, it's a standard
99 matches
Mail list logo