Re: [Linux-HA] Resource Always Tries to Start on the Wrong Node

2013-06-28 Thread Robinson, Eric
Hello Eric, On 2013-06-27 17:35, Robinson, Eric wrote: -Original Message- I don't understand why resources try to start on the wrong node (and of course fail). Pacemaker 1.0.7 ... looking at the Changelog of Pacemaker 1.0 at https://github.com/ClusterLabs/pacemaker-1.0

Re: [Linux-HA] Resource Always Tries to Start on the Wrong Node

2013-06-27 Thread Robinson, Eric
-Original Message- I don't understand why resources try to start on the wrong node (and of course fail). My nodes are ha05 and ha06. ha05 is master/primary and all resources are running on it. If I run... crm resource stop p_MySQL_185 ..the resource stops fine. Then if I

[Linux-HA] Resource Always Tries to Start on the Wrong Node

2013-06-26 Thread Robinson, Eric
I don't understand why resources try to start on the wrong node (and of course fail). My nodes are ha05 and ha06. ha05 is master/primary and all resources are running on it. If I run... crm resource stop p_MySQL_185 ..the resource stops fine. Then if I run... crm resource start p_MySQL_185

[Linux-HA] Best Corosync and Pacemaker Versions for New Cluster

2013-04-25 Thread Robinson, Eric
We are installing corosync and pacemaker on a brand new RHEL 6.3 cluster today. When we installed using yum, here are the versions that pulled down from the repos. pacemaker-libs-1.1.9-1512.el6.x86_64 pacemaker-1.1.9-1512.el6.x86_64 corosync-1.4.3-26.2.x86_64

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-26 Thread Robinson, Eric
In the simplest terms, we currently have resources: A = drbd B = filesystem C = cluster IP D thru J = mysql instances. Resource group G1 consists of resources B through J, in that order, and is dependent on resource A. This fails over fine, but it has the serious

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-26 Thread Robinson, Eric
In the simplest terms, we currently have resources: A = drbd B = filesystem C = cluster IP D thru J = mysql instances. Resource group G1 consists of resources B through J, in that order, and is dependent on resource A. This fails over fine, but it has the serious

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-26 Thread Robinson, Eric
On Wed, Mar 27, 2013 at 6:12 AM, Robinson, Eric eric.robin...@psmnv.com wrote: In the simplest terms, we currently have resources: A = drbd B = filesystem C = cluster IP D thru J = mysql instances. Resource group G1 consists of resources B through J

[Linux-HA] Many Resources Dependent on One Resource Group

2013-03-24 Thread Robinson, Eric
I've asked this question on the list before and never received a good answer, so here goes again. I've also read the Pacemaker documentation, but I just cannot seem to get this. I have a drbd resource, p_drbd0. I have a resource group, g_clust01, which consists of a filesystem (p_fs_clust01)

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-24 Thread Robinson, Eric
In the simplest terms, we currently have resources: A = drbd B = filesystem C = cluster IP D thru J = mysql instances. Resource group G1 consists of resources B through J, in that order, and is dependent on resource A. This fails over fine, but it has the serious disadvantage that if you stop

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-24 Thread Robinson, Eric
meta ordered=false Wouldn't it that make it se we could not be sure that the filesystem and cluster IP start before the MySQL instances? Or you could take the MySQL instances out of the group and make them each individually dependent on drbd/filesystem with a collocation/order

[Linux-HA] Using a Ping Daemon (or Something Better) to Prevent Split Brain

2013-01-30 Thread Robinson, Eric
We have this configuration: NodeA is located in DataCenterA. NodeB is located in (geographically separate) DataCenterB. DataCenterA is connected to DataCenterB through 4 redundant gigabit links (two physically separate Corosync rings). Both nodes reach the Internet through (geographically

Re: [Linux-HA] Using a Ping Daemon (or Something Better) to PreventSplit Brain

2013-01-30 Thread Robinson, Eric
We have this configuration: NodeA is located in DataCenterA. NodeB is located in (geographically separate) DataCenterB. DataCenterA is connected to DataCenterB through 4 redundant gigabit links (two physically separate Corosync rings). Both nodes reach the Internet through

[Linux-HA] Failover Behavior in Server-Crash Scenario

2012-12-06 Thread Robinson, Eric
With Pacemaker 1.1.8 and drbd 8.4.2, we are observing that when the primary node is put into standby mode ('crm node standby') the drbd resource on the secondary node refuses to be promoted because it is in a WFConnection state. Is this normal and by design? I don't recall seeing this behavior

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Robinson, Eric
If the promote of DRBD on one node cannot be done, this might be because the demote on the other node cannot be achieved. Do you mount a FS ? If so, force : umount -fl /mountpoint Double check (cat /proc/drbd) that the DRBD resource is really secondary on the demoted node. This is with no

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Robinson, Eric
Okay, I think I have some new information on this problem. First, upgrading to drbd 8.4.2 did not help. I believe the problem is that when I do 'crm node offline' Pacemaker is fully stopping the drbd service. This causes drbd on the secondary to go into a WFConnection state. It refuses to

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Robinson, Eric
will not failover On 12/05/2012 12:05 PM, Robinson, Eric wrote: I believe the problem is that when I do 'crm node offline' Pacemaker is fully stopping the drbd service. This causes drbd on the secondary to go into a WFConnection state. It refuses to promote to primary in that state. Probably

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Robinson, Eric
I was thinking drbd losing packets and thus falling back to WFC rather than pacemaker ordering a full stop. Gotcha. Well, I think it is demonstrably the case that it is losing packets because the service is stopped. you could probably find the stop action in the RA and replace it with

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Robinson, Eric
you could probably find the stop action in the RA and replace it with (e.g.) logger 'AIE ***I did not want this***' and then see what gets logged. -- Well, that worked, in the sense that the resource now fails over. I replaced the start and stop actions in the RA with logger

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-04 Thread Robinson, Eric
will not failover 02.12.2012 00:34, Robinson, Eric wrote: Try to set 'target-role=Started' in both of them. Okay, but how does that address the problem of error code 11 from drbdadm? Well, you have error promoting resources. 11 is EAGAIN, usually meaning you did not demote the other

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-04 Thread Robinson, Eric
-- Eric Robinson Director of Information Technology Physician Select Management, LLC 775-885-2211 x 111 I am not sure if that will really help you - but in my cluster (ok older pacemaker version) I ahve the following to define a master slave resource: primitive rsc_sap_HA0_ASCS00

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-01 Thread Robinson, Eric
Try to set 'target-role=Started' in both of them. Okay, but how does that address the problem of error code 11 from drbdadm? --Eric Disclaimer - December 1, 2012 This email and any files transmitted with it are confidential and intended solely for 'General Linux-HA mailing list'. If

[Linux-HA] master/slave drbd resource STILL will not failover

2012-11-29 Thread Robinson, Eric
Bump... does anyone have some insight on this? Google is not turning up anything useful. Our newest cluster will not failover master/slave drbd resources. It works fine manually using drbdadm from a shell prompt, but when we try it using 'crm node standby' and letting the cluster manage the

[Linux-HA] master/slave drbd resource STILL will not failover

2012-11-28 Thread Robinson, Eric
I posted about this a couple of weeks ago but didn't get a response. Our newest cluster will not failover master/slave drbd resources. It works fine manually using drbdadm from a shell prompt, but when we try it using 'crm node standby' and letting the cluster manage the resource, crm_mon just

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-16 Thread Robinson, Eric
Well I happen to be working on a new landing page which partially addresses these points. I'll see if I can can get most of them covered. That would be awesome. The overall feel that I have is that of making my way down a country road in the rain. I often feel as though I have missed an

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-15 Thread Robinson, Eric
clusterlabs.org/doc is as good as i can do for docs. i try to keep it up-to-date and version specific (so that documenting corosync 2.x doesn't obliterate the cman/plugin stuff). packages are mostly in the hands of the distros though. building the entire stack (and keeping it up-to-date)

Re: [Linux-HA] pcs or crmsh?

2012-11-14 Thread Robinson, Eric
Should I be using pcs or crmsh? Neither one seems to work quite right. What doesn't work? I think that at this point of time, it's be easier to get crmsh going/fixed with pcmk 1.1.8. It's probably just some path somewhere. If really nothing works, you *must* use LCMC, Pacemaker GUI.

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Robinson, Eric
That's what makes open-source's ecosystem so vibrant. :) I suspect that you are saying that with tongue somewhat in cheek. :-) Speaking as someone who just wants to get a job done, I do find the rockiness of the terrain discouraging. Just when I was getting used to crmsh, I hear that

Re: [Linux-HA] Antw: Re: cib_replace failed?

2012-11-14 Thread Robinson, Eric
It is actually worse than that: for as long as I remember RH has included a trap for young players where if you edit /etc/hosts all sorts of interesting things may happen after next reboot. Or rpm update. Depending on your choice of editor and phase of the moon. None of my RH6 machines

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Robinson, Eric
I totally agree. I try to use HA setups in production environments but I only do 2 or so a year and meanwhile I have a complete zoo of versions, tools, shells etc. I was trying fairly hard not to say something like that for fear of alienating helpful members of the list, but I have to

Re: [Linux-HA] Antw: Re: pcs or crmsh?

2012-11-14 Thread Robinson, Eric
I can think of 3 tooling changes: - ptest/crm_simulate - hb_report/crm_report - standalone crmsh Thats not /too/ bad in 4 years. Fair enough. From my perspective, I was thinking of the fact that our first cluster was rolled out in 2006, back when the documentation was all about paul

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Robinson, Eric
bump. Could someone please review the logs in the links below and tell me what the heck is going on with this cluster? I've never encountered anything like this before. Basically, corosync thinks the cluster is healthy but Pacemaker won't elect a DC! -- Hi Andrew, would love to see the

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Robinson, Eric
Lars, Did you not see my other mail? Regards, Lars Gosh, no, I don't see another one from you about this in the list. I can't imagine what might have happened to it. Can you resend it? --Eric Disclaimer - November 13, 2012 This email and any files transmitted with it are

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Robinson, Eric
Hi Lars, found your email... Um, are you setting a nodeid in corosync.conf? Because I see this: Nov 09 09:07:25 [2609] ha09a.mycharts.md crmd: crit: crm_get_peer: Node ha09a.mycharts.md and ha09a share the same cluster node id '973777088'! This is not

Re: [Linux-HA] cib_replace failed?

2012-11-13 Thread Robinson, Eric
Lars, I'd probably strip everything except the short names out of /etc/HOSTNAME and /etc/hosts, though it may be sufficient to make sure the short names come first. I think something changed with regard to hostname handling in 1.1.8. It looks like you were right about 1.1.8 handling

[Linux-HA] pcs or crmsh?

2012-11-13 Thread Robinson, Eric
Should I be using pcs or crmsh? Neither one seems to work quite right. Here are the packages I have installed on my RHEL 6.3 servers... [root@ha09a ~]# rpm -qa|egrep pacem|coro|crmsh|pcs|sort corosync-1.4.1-7.el6_3.1.x86_64 corosynclib-1.4.1-7.el6_3.1.x86_64 crmsh-1.2.1-45.2.x86_64

Re: [Linux-HA] pcs or crmsh?

2012-11-13 Thread Robinson, Eric
The official management tool is/will be pcs. That said, crm has been around for a while, so it might be more complete/stable. I know that, personally, I will be learning pcs. digimer I tried using pcs but I ran into a roadblock right away. The Clusters from Scratch document refers to

Re: [Linux-HA] cib_replace failed?

2012-11-12 Thread Robinson, Eric
Hi Andrew, would love to see the logs from ha09b Below are links to a clean set of logs from nodes ha09a and ha09b. The procedure I followed to collect the logs was: 1. Ensure pacemakerd and corosync are stopped on both nodes. 2. Remove corosync.log on both nodes. 3. Start corosync on ha09a.

Re: [Linux-HA] cib_replace failed?

2012-11-11 Thread Robinson, Eric
Andrew, Um, are you setting a nodeid in corosync.conf? Because I see this: Nov 09 09:07:25 [2609] ha09a.mycharts.md crmd: crit: crm_get_peer: Node ha09a.mycharts.md and ha09a share the same cluster node id '973777088'! Which could easily explain why the cluster

Re: [Linux-HA] cib_replace failed?

2012-11-09 Thread Robinson, Eric
Andrew, I updated to 1.1.8 from the clusterlabs-next repo. Now I am back to the problem where no DC gets elected... Last updated: Thu Nov 8 10:10:06 2012 Last change: Fri Nov 2 17:16:29 2012 Current DC: NONE 0 Nodes configured, unknown expected votes 0 Resources configured.

Re: [Linux-HA] cib_replace failed?

2012-11-08 Thread Robinson, Eric
\ On Thu, Nov 8, 2012 at 8:31 AM, Robinson, Eric eric.robin...@psmnv.com wrote: Okay, I'll look forward to seeing those. Should I use 1.1.7 until then or just wait? They should be there now. Excellent, I will get them. What is the not ideal part? Well the reason we released 1.1.8

Re: [Linux-HA] cib_replace failed?

2012-11-08 Thread Robinson, Eric
Andrew, On Thu, Nov 8, 2012 at 8:31 AM, Robinson, Eric eric.robin...@psmnv.com wrote: Okay, I'll look forward to seeing those. Should I use 1.1.7 until then or just wait? They should be there now. What is the not ideal part? Well the reason we released 1.1.8 is because we fixed

Re: [Linux-HA] cib_replace failed?

2012-11-07 Thread Robinson, Eric
I tried to build 1.1.8-237 and ran into too many dependency problems. I uninstalled everything and reinstalled using version 1.1.7 rpms, and now I finally get the expected... Last updated: Mon Nov 5 09:08:04 2012 Last change: Mon Nov 5 09:07:55 2012 via crmd on

Re: [Linux-HA] cib_replace failed?

2012-11-05 Thread Robinson, Eric
Basically you're hitting a bug in 1.1.8 I made the is node X alive? check much stricter but there are certain timing windows which produce a false positive. You should be fine after applying the following two patches (or building from HEAD): + Andrew Beekhof (4 weeks ago) b87494d:

Re: [Linux-HA] cib_replace failed?

2012-11-05 Thread Robinson, Eric
Basically you're hitting a bug in 1.1.8 I made the is node X alive? check much stricter but there are certain timing windows which produce a false positive. You should be fine after applying the following two patches (or building from HEAD): + Andrew Beekhof (4 weeks

[Linux-HA] Building from Source

2012-11-05 Thread Robinson, Eric
Is this where I can get latest Pacemaker source, complete with the patches that fix the timing window/false positives issue that Andrew mentioned on the 'cib_replace failed' thread? https://github.com/ClusterLabs/pacemaker/tarball/master -- Eric Robinson Disclaimer - November 5, 2012 This

Re: [Linux-HA] cib_replace failed?

2012-11-05 Thread Robinson, Eric
Basically you're hitting a bug in 1.1.8 I made the is node X alive? check much stricter but there are certain timing windows which produce a false positive. You should be fine after applying the following two patches (or building from HEAD): + Andrew Beekhof (4

Re: [Linux-HA] cib_replace failed?

2012-11-02 Thread Robinson, Eric
One has to wonder if the cause of problem is your systems are bogged down by iowait resulting from all that logging and are e.g. dropping packets. -- Dimitri Maziuk rsyslog rate limiting is normal behavior. The messages to the corosync.log are not being dropped and the system if

Re: [Linux-HA] cib_replace failed?

2012-11-02 Thread Robinson, Eric
This is just crazy as heck. The rings are fine and both nodes have joined the cluster, but when I start Pacemaker no DC ever gets elected. [root@ha09b corosync]# corosync-cfgtool -s Printing ring status. Local node ID 990554304 RING ID 0 id = 192.168.10.59 status = ring 0

Re: [Linux-HA] cib_replace failed?

2012-11-01 Thread Robinson, Eric
That was still the official version in git at the time. See below. Perhaps try the official upstream release of 1.1.8 for RHEL-6? http://www.clusterlabs.org/rpm-next/ Will do. Here is what we have installed... [root@ha09a log]# rpm -qa|egrep pacem|coros

Re: [Linux-HA] cib_replace failed?

2012-11-01 Thread Robinson, Eric
pacemaker-1.1.8-0.901.eedc0cc.git.el6.x86_64 Thats an interesting version you have there. Where did you get it from? When I tried to remove it, it said...

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Robinson, Eric
Of Robinson, Eric Sent: Tuesday, October 30, 2012 12:25 PM To: General Linux-HA mailing list Subject: Re: [Linux-HA] cib_replace failed? Bringing up a brand new cluster with corosync 1.4.3 and pacemaker 1.1.8. Configuration fails right away. My first configuration command times out

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Robinson, Eric
On 10/31/2012 04:59 PM, Robinson, Eric wrote: Nobody has any thoughts on why my 2-node cluster has no DC? As I mentioned, corosync-cfgtool -s shows the ring active with no faults. Do you have /etc/corosync/service.d/pcmk? And does it look exactly as the example given here? http

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Robinson, Eric
That probably means that someone (i.e., you ;-) needs to dig more into the logs of corosync pacemaker. There's bound to be a clue there; pacemaker is many things, but definitely not shy when it comes to logging. Regards, Lars Yes, that's what it probably means. :-) For what it

Re: [Linux-HA] cib_replace failed?

2012-10-31 Thread Robinson, Eric
Okay, the two node names are ha09a and ha09b. Starting clean with all services turned off. This is what I get in /var/log/corosync.log on ha09a when I start corosync... Oct 31 10:22:43 corosync [MAIN ] Corosync Cluster Engine ('1.4.3'): started and ready to provide service. Oct 31 10:22:43

[Linux-HA] cib_replace failed?

2012-10-30 Thread Robinson, Eric
Bringing up a brand new cluster with corosync 1.4.3 and pacemaker 1.1.8. Configuration fails right away. My first configuration command times out with the error... [root@ha09a ~]# crm configure property stonith-enabled=false Call cib_replace failed (-62): Timer expired null ERROR: could not

Re: [Linux-HA] cib_replace failed?

2012-10-30 Thread Robinson, Eric
Bringing up a brand new cluster with corosync 1.4.3 and pacemaker 1.1.8. Configuration fails right away. My first configuration command times out with the error... [root@ha09a ~]# crm configure property stonith-enabled=false Call cib_replace failed (-62): Timer expired null

Re: [Linux-HA] Who Stole CRM? Just Kidding. But Seriously.

2012-10-29 Thread Robinson, Eric
. On Mon, Oct 29, 2012 at 8:09 PM, Robinson, Eric eric.robin...@psmnv.com wrote: It's been replaced in RHEL by PSC. You can still use CRM (or any other manager), but it's not maintained by the pacemaker devs. Madi I'll use whatever I can get my hands on. Neither crm nor pcs came

Re: [Linux-HA] We Rebooted a Healthy Standby Node and All the Services on the Primary Node Restarted?

2012-05-09 Thread Robinson, Eric
: to be checked (chkconfig etc.) Alain Verified. Services are only started by Pacemaker, nothing starts on boot (except Pacemaker and Corosync). De :Robinson, Eric eric.robin...@psmnv.com A : linux-ha@lists.linux-ha.org Date : 07/05/2012 20:27 Objet : [Linux-HA] We Rebooted

[Linux-HA] We Rebooted a Healthy Standby Node and All the Services on the Primary Node Restarted?

2012-05-07 Thread Robinson, Eric
Hi guys, we rebooted a standby node of a healthy cluster and suddenly all the resources on the primary cluster restarted. What's up with that? Before rebooting the standby node, we did the normal stuff to verify that all was well. crm_mon showed all nodes online, in their expected roles, with

[Linux-HA] Off-site Quorum Provider?

2012-03-02 Thread Robinson, Eric
We have two geographically separate data centers connected by 4 x Gigabit links (in 2 trunks). Our HA clusters are distributed between the data centers, with each node of a 2-node cluster in a separate data center. (In the case of our 3-node clusters, 2 nodes are in one data center and the 3rd

Re: [Linux-HA] Off-site Quorum Provider?

2012-03-02 Thread Robinson, Eric
, 2012 at 10:34 PM, Robinson, Eric eric.robin...@psmnv.com wrote: We have two geographically separate data centers connected by 4 x Gigabit links (in 2 trunks). Our HA clusters are distributed between the data centers, with each node of a 2-node cluster in a separate data center

[Linux-HA] Light Weight Quorum Arbitration

2011-12-03 Thread Robinson, Eric
I have a geographically dispersed (stretch) cluster, where one node is in data center A and the other node is in data center B. I have done everything possible to ensure link redundancy between the cluster nodes. Each node has 4 x gigabit links connected to 4 different sets of switches and routers

[Linux-HA] Should This Worry Me?

2011-11-12 Thread Robinson, Eric
Should I be concerned that the standby node of a 2-node cluster is logging these messages about every 15 seconds? This cluster has been up and running apparently fine for a year. Nov 12 17:53:59 ha05 crm_attribute: [2420]: info: Invoked: crm_attribute -N ha05.mycharts.md -n master-p_DRBD:1 -l

Re: [Linux-HA] How to Enabled More Detailed Debugging

2011-11-07 Thread Robinson, Eric
As Florian mentioned, there's the debug option, but I doubt think it is going to help. What may help is to take a look at the network traffic, but you'd need really good sight ;-) Thanks, You're right, it didn't help. What helped was going back to the Linux bonding documentation,

Re: [Linux-HA] A Few Things Resolved

2011-11-06 Thread Robinson, Eric
cluster. Whew. -- Eric Robinson Eric, Please file a bug against the scripts then. Thank you. Regards, Tristan I imagine since I heard about it from Florian, who heard about it in an IRC chat, somebody must be way ahead of me on that. --Eric Disclaimer - November 6

Re: [Linux-HA] My Shiny New Cluster Works Great Except...

2011-11-06 Thread Robinson, Eric
-11-06 09:13, Robinson, Eric wrote: Two little problems with my new cluster. 1. When I put the primary node in standby, the resources failover to the other node just fine. When I put the primary back online, the resources automatically fail back, but DRBD on the stanby node goes

Re: [Linux-HA] My Shiny New Cluster Works Great Except...

2011-11-06 Thread Robinson, Eric
That's quite definitely a misconfiguration. Please create a CIB dump with cibadmin -Q, make that available on an HTTP server somewhere (might as well be pastebin or similar), and share the URL here. Done. www.psmnv.com/downloads/cibadmin.dump 2. When I do drbdadm up, I get the

[Linux-HA] A Few Things Resolved

2011-11-05 Thread Robinson, Eric
A couple of things are fixed. The ring FAULTY messages were caused by genuine network communication failures (go figure) which in turn had two root causes. One was my error and the other was Red Hat's. Although I have set up bonding many times before, on these servers I had BONDING_OPS instead of

[Linux-HA] Are You (or Can You Be) a Corosync Consultant?

2011-11-04 Thread Robinson, Eric
We are unable to find the cause of ringid FAULTY adminisrtative intervention required on our newest cluster. Is there someone in this list who knows corosync really well and who we could hire on a consulting basis? Frankly, we're desperate. --Eric Disclaimer - November 4, 2011 This email

[Linux-HA] How to Enabled More Detailed Debugging

2011-11-03 Thread Robinson, Eric
We keep getting 'ringid FAULTY adminisrtative intervention required' but there is nothing in the logs that indicates why it reached this decision. Is there a way to enable more detailed debugging so I can see why it is disabling the ring? --Eric Disclaimer - November 3, 2011 This email and

Re: [Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-03 Thread Robinson, Eric
I have two rings configured. Everything looks fine until I bring up the second node. Then the second ring on both nodes reports: status = Marking seqid 21 ringid 1 interface 198.51.100.55 FAULTY - adminisrtative intervention required. The rings are on different

Re: [Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-03 Thread Robinson, Eric
Technology Physician Select Management, LLC 775-885-2211 x 111 -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Robinson, Eric Sent: Thursday, November 03, 2011 4:43 AM To: General Linux-HA mailing list Subject: Re

Re: [Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-03 Thread Robinson, Eric
logging { debug: off } How about changing that to on? Yes, I feel silly. Thanks. --Eric Disclaimer - November 3, 2011 This email and any files transmitted with it are confidential and intended solely for General Linux-HA mailing list. If you are not the named addressee you

Re: [Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-03 Thread Robinson, Eric
Can anyone see from this debug log why the ring is being marked faulty? The FAULTY message is the last line. Nov 03 16:57:01 corosync [MAIN ] Corosync Cluster Engine ('1.2.3'): started and ready to provide service. Nov 03 16:57:01 corosync [MAIN ] Corosync built-in features: nss rdma Nov 03

Re: [Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-03 Thread Robinson, Eric
Of Robinson, Eric Sent: Thursday, November 03, 2011 5:04 PM To: General Linux-HA mailing list Subject: Re: [Linux-HA] ringid FAULTY adminisrtative intervention required Can anyone see from this debug log why the ring is being marked faulty? The FAULTY message is the last line. Nov 03 16:57:01

[Linux-HA] ringid FAULTY adminisrtative intervention required

2011-11-02 Thread Robinson, Eric
After the euphoria of fixing my problem where selinux breaks ring initialization (thanks all) I thought I would have smooth sailing. No such luck. I have two rings configured. Everything looks fine until I bring up the second node. Then the second ring on both nodes reports: status = Marking

Re: [Linux-HA] Antw: ringid FAULTY adminisrtative intervention required

2011-11-02 Thread Robinson, Eric
We are using redundant rings at two different speeds in SLES11 SP1. Until we upgraded corosync from 1.4 to some 1.4 version, the slower ring was marked failty every time there was significant configuartion exchange on the rings. Now a ring still gets faulty from time to time, but it does

[Linux-HA] Does ANYTHING Work on RHEL6?

2011-10-31 Thread Robinson, Eric
I can't get a cluster up on RHEL6. First I tried pacemaker+corosync, but corosync complains... Could not get the ring status, the error is: 6 ..and I cannot connect to the cluster. So then I tried pacemaker+heartbeat, only to learn that pacemaker no longer supports the heartbeat cluster

Re: [Linux-HA] Does ANYTHING Work on RHEL6?

2011-10-31 Thread Robinson, Eric
Florian's suggestion sounds like a good start for you. After that, try firewalls and selinux. Well, sheesh, it was selinux. Write that one down, folks. Selinux causes error 6 problem when initializing the ring. And this is all because the RHEL 6 installer does not ask whether selinux should

[Linux-HA] Could not get the ring status, the error is: 6

2011-10-30 Thread Robinson, Eric
I just installed and configured corosync-1.2.3-21.el6_0.1.x86_64 on RHEL6. At startup, the corosync log appears to be complete except for the line, A processor joined or left the membership and a new membership was formed. I cannot connect to the cluster, and corosync-cfgtool states: Local node

Re: [Linux-HA] Antwort: Re: Escaping Depenencies in Resource Groups

2011-09-30 Thread Robinson, Eric
list linux-ha@lists.linux-ha.org An 'General Linux-HA mailing list' linux-ha@lists.linux-ha.org Kopie Thema Re: [Linux-HA] Escaping Depenencies in Resource Groups Robinson, Eric wrote on 2011-09-29: We have a 3-node cluster running about 200 instances of MySQL. The way

[Linux-HA] Escaping Depenencies in Resource Groups

2011-09-29 Thread Robinson, Eric
We have a 3-node cluster running about 200 instances of MySQL. The way we have our resource groups set up, the dependency stack looks like this: Cluster_IP Filesystem MySQL_001 MySQL_002 MySQL_003 MySQL_004 ...

Re: [Linux-HA] Always Get a Billion Failed Actions

2011-07-14 Thread Robinson, Eric
On Thu, Jun 16, 2011 at 8:38 PM, Robinson, Eric eric.robin...@psmnv.com wrote: crm_mon on my system displays a lot of failed actions, I guess because the init script for the resource is not fully lsb compliant? In any case, the resources seem to work okay and failover okay. How can I

[Linux-HA] Always Get a Billion Failed Actions

2011-06-16 Thread Robinson, Eric
crm_mon on my system displays a lot of failed actions, I guess because the init script for the resource is not fully lsb compliant? In any case, the resources seem to work okay and failover okay. How can I get rid of all those failed actions? crm_mon output follows... Last

[Linux-HA] Linux-HA Over WAN Advisable?

2011-04-01 Thread Robinson, Eric
Greetings! We have a few Corosync+PaceMaker+DRBD clusters and a couple older Heartbeat+DRBD clusters. Our infrastructure is currently located in a single facility. We have the opportunity to establish a DR site in another data center over a high bandwidth connection (2-4Gbps). I am thinking of

[Linux-HA] Failover Post Mortem - Why did the Healthy Node Give its Resources to the Unhealthy One?

2011-03-03 Thread Robinson, Eric
I woke up this morning and discovered that one of my clusters had failed over during the night. Everything was working fine, but I wanted to know what happened. From reading the logs, it looks to like the primary node ftp02 gave up its resources to the secondary node ftp01, which had become

Re: [Linux-HA] How to monitor the nic link status

2010-11-29 Thread Robinson, Eric
-Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Mia Lueng Sent: Monday, November 29, 2010 6:24 AM To: linux-ha@lists.linux-ha.org Subject: [Linux-HA] How to monitor the nic link status Hi: I have configured a

Re: [Linux-HA] Why Did the Resource Fail Back whenResource Stickiness was Set?

2010-11-25 Thread Robinson, Eric
FYI -- additional info. 1. I did an 'unmove' for resources g_clust04 and g_clust05, which removed the 'location cli-prefer' statements from the crm config. The resources stayed where they were. 2. I changed resource-stickiness to 200. 3. I performed the power plug pull test on node ha07b again.

[Linux-HA] Why Did the Resource Fail Back when Resource Stickiness was Set?

2010-11-24 Thread Robinson, Eric
I performed a power-plug-pull test on my newest cluster and it failed over as expected. However, when I restored power to the failed node, the resources failed back. I don't understand why this happened since I have resource stickiness set. There are three nodes in the cluster. I pulled the plug

Re: [Linux-HA] Why Did the Resource Fail Back whenResource Stickiness was Set?

2010-11-24 Thread Robinson, Eric
Nope, this was created by move/migrate command. Somebody forgot unmove/unmigrate Actually, it wasn't. We pulled the plug on node ha07b, watched the resource failover, and then plugged it back in and watched it fail back. We didn't issue a move command that I recall. -- Eric Robinson

Re: [Linux-HA] Why Did the Resource Fail Back when Resource Stickiness was Set?

2010-11-24 Thread Robinson, Eric
I've went through something similar, and this could be the case as well, the combined score for the group is higher than the resource-stickiness value, setting it to a much higher value (or even inf:) would do the job. I'll give that a try! -- Eric Robinson Disclaimer - November 24,

[Linux-HA] 3-Node DRBD Cluster without Stacking

2010-11-16 Thread Robinson, Eric
I'm not sure if this list or the DRBD list is the right one to ask this. Is it possible to deploy a 3-node CRM-based cluster where: -- nodes A and C share resource R1 on /dev/drbd0 -- nodes B and C share resource R2 on /dev/drbd1 -- resource constraints prevent R1 from

Re: [Linux-HA] 3-Node DRBD Cluster without Stacking

2010-11-16 Thread Robinson, Eric
I'm not sure if this list or the DRBD list is the right one to ask this. Is it possible to deploy a 3-node CRM-based cluster where: -- nodes A and C share resource R1 on /dev/drbd0 -- nodes B and C share resource R2 on /dev/drbd1 -- resource constraints prevent R1

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-24 Thread Robinson, Eric
That way, if something happens to switched network #1, Corosync can still track node status through switched net #2. Once this configuration is built, I can use Pacemaker with resource constraints to ensure that resource R1 can only run on SERVER_A or SERVER_C (usually A) and

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-24 Thread Robinson, Eric
But now could someone please elaborate on Dejan Muhamedagic's original comment that started the thread? What does redundant rings are still not there mean? Is a three-node cluster an unreliable setup because Corosync and/or Pacemaker are not really ready for that? not Pacemaker -

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-24 Thread Robinson, Eric
Heartbeat uses term communication path, instead. And yes it has been able to support more then two node cluster since version 2. Take a look at http://oss.linbit.com/drbd-mc/ It's a nifty java application which can help you to create your initial cluster configuration in no time.

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-24 Thread Robinson, Eric
Heartbeat is not deprecated, it is still supported by Linbut folks. (Many thanks to them). But if you need clvmd, GFS2, you would have to use corosync, for example. It may not be deprecated per se, but there is no getting started guide on the ClusterLabs site for using PaceMaker+Heartbeat

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-24 Thread Robinson, Eric
But don't let me stop you from using corosync, you still can build your particular cluster with the same amount of hardware. The only thing that would stop me from using corosync is the thought that it is somehow unreliable or not there yet, scary as that sounds. The cluster would be

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-23 Thread Robinson, Eric
Looks like you are mixing up physical connections and Corosync rings. I should not have mentioned DRBD at all as it confuses the question. Let me try it this way: How do I build a three-node Corosync cluster with redundant heartbeat paths? I don't trust the switched network or the Ethernet

Re: [Linux-HA] Redundant Rings Still Not There?

2010-10-23 Thread Robinson, Eric
3-node cluster is much easier to say than to configure, apparently. :-) It really isn't :) Encouraged by your it really isn't, I now press forward. :-) Based on what I'm hearing, this is what I think I have learned... It is possible to build a 3-node cluster with redundant heartbeat

  1   2   >