Re: [Linux-HA] HA Summit Key-signing Party (was: Organizing HA Summit 2015)

2015-01-26 Thread Marcus Bointon
I have some keybase.io invitations if anyone wants one.

Marcus
--
Marcus Bointon
Technical Director, Synchromedia Limited

Creators of http://www.smartmessages.net/
UK 1CRM solutions http://www.syniah.com/
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] How many primitives, groups can I have

2013-11-11 Thread Marcus Bointon
On 11 Nov 2013, at 13:57, Michael Brookhuis mimabr...@googlemail.com wrote:

 Is there a limit in the number of proimitives, etc you can have?
 What maximum number is recommended based on best-practices?
 
 Are 1500 to many?

I think it depends on your transport layer. If you're using heartbeat I think I 
ran into a problem where the whole resource definitions had to fit into one 
packet, which was typically around 30. I think corosync removed that limit. I'm 
hazy on the details though.

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] The Heartbeat of learning

2013-10-16 Thread Marcus Bointon
On 16 Oct 2013, at 16:44, Digimer li...@alteeve.ca wrote:

 Corosync uses the totem protocol for heartbeat like monitoring of the...

Thank you for the clearest summary of the HA stack I've seen so far! It should 
be on the HA site somewhere (it probably is...!)

Marcus


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Need some help, Two node M/S IPAddr2 IP Failover, both nodes slaves

2013-08-09 Thread Marcus Bointon
On 9 Aug 2013, at 06:29, Gary Mazzaferro ga...@oedata.com wrote:

 Additionally, the traffic to the nodes seem interleaved, random connection
 to node1/node2 from clients.
 
 And, when I shut down node2 or place it in standby, the VIP doesn't shift
 to node1, it appears the the node1 is down.

This sounds like you may not have an ARP resource grouped with the IP, so 
switches are serving to cached nodes. This is my usual config for managing a 
floating IP:

primitive ip2 ocf:heartbeat:IPaddr2 params ip=x.x.x.x cidr_netmask=24 op 
monitor interval=10s nic=eth0 
primitive ip2arp ocf:heartbeat:SendArp params ip=x.x.x.x nic=eth0 
group proxyfloat2 ip2 ip2arp
location cli-standby-proxyfloat2 proxyfloat2 rule 
$id=cli-standby-rule-proxyfloat2 -inf: #uname eq proxy1 and #uname eq proxy2
colocation ip_with_arp2 inf: ip2 ip2arp
order arp_after_ip2 inf: ip2:start ip2arp:start

Change your IPs and node names as appropriate.

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MailTo resources, 'message too long' errors

2013-08-06 Thread Marcus Bointon
On 6 Aug 2013, at 11:10, Dejan Muhamedagic deja...@fastmail.fm wrote:

 Which compression setting do you have now? I think you should
 try with different compression settings as suggested there by Lars.

I had no compression set at all. I added the settings as advised in that 
posting and it does seem to have solved the problem for now, though clearly I 
need to move to corosync ASAP before my CIB gets big enough to break again!

Any idea on how to improve email notifications? At the moment the only 
notifications I get out of the cluster amount to 'something happened'.

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] MailTo resources, 'message too long' errors

2013-08-05 Thread Marcus Bointon
I have two nodes running heartbeat 3.0.5 and pacemaker 1.1.6 (both from the 
linux-ha lucid ppa). They are running 11 groups each comprising an 
ocf:heartbeat:IPaddr2, an ocf:heartbeat:SendArp and an ocf:heartbeat:MailTo.

There is also a mailto resource configured for the overall cluster.

Despite all these, all the notifications I ever receive look identical:

Heartbeat status change: Migrating resource away at Mon Aug  5 13:01:49 UTC 
2013 from proxy2
Command line was:
/usr/lib/ocf/resource.d//heartbeat/MailTo stop

One major omission here is that it doesn't tell me which resource it migrated.

Is there some way of configuring the cluster itself to send notifications so 
that I can remove the individual mailto resources?

Coincidentally (?), I've just started to get this problem:

Aug  5 11:13:50 proxy1 heartbeat: [2958]: ERROR: glib: ucast_write: Unable to 
send HBcomm packet eth0 192.168.1.10:694 len=78903 [-1]: Message too long
Aug  5 11:13:50 proxy1 heartbeat: [2958]: ERROR: write_child: write failure on 
ucast eth0.: Message too long

This (well at least I assume it's this) is resulting in resources disappearing, 
randomly starting and stopping, flip-flopping between nodes, marking nodes as 
offline and more fun things to keep us awake at night.

The only explanation I've found for this is here 
http://comments.gmane.org/gmane.linux.highavailability.pacemaker/10765
The solutions suggested are to alter compression settings (which we were not 
using before), migrate to corosync and/or to make the cib smaller, hence the 
idea of removing the individual mailtos.

I've run hb_report and that doesn't say anything useful, more or less it 
doesn't work.

I'd like to migrate to corosync if it's better, but I'm extremely wary of 
touching anything in the cluster.

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource moves

2013-04-19 Thread Marcus Bointon
On 19 Apr 2013, at 14:48, Florian Crouzat gen...@floriancrouzat.net wrote:

 Well, you kinda answered this when you mentioned crm_resource -U.
 You should use unmove instead of move. Unmove will remove the 
 location constraint where move will create a new one.

Thanks, that sounds much better - but it has a slightly annoying effect. I 
normally have a location rule like this:

location cli-prefer-ip3 ip3 rule $id=cli-prefer-rule-ip3 inf: #uname eq proxy1

When I issue a move, it gets removed and replaced with:

location cli-standby-ip3 ip3 \
rule $id=cli-standby-rule-ip3 -inf: #uname eq proxy1

When I unmove, it deletes that location rule, but because the preference rule 
has been removed, it doesn't result in the ip moving back because there is no 
longer any preferred node, so no incentive for it to do so. This seems to 
rather defeat the point of unmove.

Is there a move/unmove that doesn't apply location rules, but just tells the 
resource to move?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource moves

2013-04-19 Thread Marcus Bointon
On 19 Apr 2013, at 15:37, Florian Crouzat gen...@floriancrouzat.net wrote:

 Maybe you should start using different CIBs, each one of them containing 
 a certain set of location constraints.

I wouldn't know where to start with that - but is there really no way to tell a 
resource to go to a particular node without creating persistent location rules?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource moves

2013-04-19 Thread Marcus Bointon

On 19 Apr 2013, at 14:48, Florian Crouzat gen...@floriancrouzat.net wrote:

 About move and constraints, it all goes down to a design choice, but to 
 me, it makes sense (for the reasons I mentioned in my first answer) and 
 it's documented, so ... :)

 When you, as 
 an admin says otherwise, the cluster trusts you and create a location 
 constraint representing the administrative decision you just took.

While I am obviously the ultimate decider of what goes where, this mechanism 
doesn't allow me to separate these intentions:

* Move resource x to node 1 now
* Move resource x to node 1 now and never allow it to come back

As far as I can see only the latter is possible, if I'm to believe the This 
will be the case even if node 1 is the last node in the cluster warning.

It's obviously possible to have a resource sitting a node and have no 
applicable location rules, and yet it stays put. Since I can create location 
rules that may result in implicit moves, and I can issue move commands too, it 
doesn't seem necessary that the two should be tied together.

I think the most practical solution is to always follow a move with an unmove - 
though it's pretty counter-intuitive and clumsy, kind of like trying to drive a 
car by issuing written instructions...

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Resource moves

2013-04-19 Thread Marcus Bointon
On 19 Apr 2013, at 16:41, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de 
wrote:

 Let me comment: crm resource migrate prm_yours PT2M will make a constraint 
 that will stay in your CIB forever also, but it's active only for 2 minutes.

Where does the 2 minutes come from? As far as I can see they stick around until 
you delete them?

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Resource move not moving

2013-04-16 Thread Marcus Bointon
I'm running crm using heartbeat 3.0.5 pacemaker 1.1.6 on Ubuntu Lucid 64.

I have a small resource group containing an IP, ARP and email notifier on a 
cluster containing two nodes called proxy1 and proxy2. I asked it to move 
nodes, and it seems to say that was ok, but it hasn't actually moved, and 
crm_mon still shows it on the original node.

# crm resource move proxyfloat3
WARNING: Creating rsc_location constraint 'cli-standby-proxyfloat3' with a 
score of -INFINITY for resource proxyfloat3 on proxy1.
This will prevent proxyfloat3 from running on proxy1 until the 
constraint is removed using the 'crm_resource -U' command or manually with 
cibadmin
This will be the case even if proxy1 is the last node in the cluster
This message can be disabled with -Q

This was in syslog:

Apr 16 13:32:35 proxy1 cib: [2948]: info: cib_process_request: Operation 
complete: op cib_delete for section constraints (origin=local/crm_resource/3, 
version=0.57.2): ok (rc=0)
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: - cib admin_epoch=0 
epoch=57 num_updates=2 /
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + cib 
validate-with=pacemaker-1.0 crm_feature_set=3.0.5 have-quorum=1 
admin_epoch=0 epoch=58 num_updates=1 cib-last-written=Tue Apr 16 
08:52:01 2013 dc-uuid=68890308-615b-4b28-bb8b-5aa00bdbf65c 
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: +   configuration 
Apr 16 13:32:35 proxy1 crmd: [2952]: info: abort_transition_graph: 
te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.58.1) : Non-status change
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + constraints 
Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: +   rsc_location 
id=cli-standby-proxyfloat3 rsc=proxyfloat3 
Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + rule 
id=cli-standby-rule-proxyfloat3 score=-INFINITY boolean-op=and 
Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_pe_invoke: Query 150: Requesting 
the current CIB: S_POLICY_ENGINE
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: +   expression 
id=cli-standby-expr-proxyfloat3 attribute=#uname operation=eq 
value=proxy1 type=string __crm_diff_marker__=added:top /
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /rule
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: +   /rsc_location
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /constraints
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: +   /configuration
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /cib
Apr 16 13:32:35 proxy1 cib: [2948]: info: cib_process_request: Operation 
complete: op cib_modify for section constraints (origin=local/crm_resource/4, 
version=0.58.1): ok (rc=0)

Yet crm status still shows:

 Resource Group: proxyfloat3
 ip3(ocf::heartbeat:IPaddr2):   Started proxy1
 ip3arp (ocf::heartbeat:SendArp):   Started proxy1
 ip3email   (ocf::heartbeat:MailTo):Started proxy1

So if all that's true, why is that resource group still on the original node? 
Is there something else I need to do?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource move not moving

2013-04-16 Thread Marcus Bointon
 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: Starting 
PEngine Recheck Timer
Apr 16 13:32:35 proxy1 pengine: [28796]: notice: process_pe_message: Transition 
25: PEngine Input stored in: /var/lib/pengine/pe-input-127.bz2
Apr 16 13:33:17 proxy1 cib: [2948]: info: cib_stats: Processed 5 operations 
(6000.00us average, 0% utilization) in the last 10min

There's a lot there, but nothing that clearly says I moved proxyfloat3, or 
more to the point, didn't...

Marcus


-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] PSU tip

2013-01-03 Thread Marcus Bointon
This is a little something I ran into a while ago, and it occurred to me it 
might be of interest to anyone doing fencing or other power control operations.

Many servers have a BIOS option for what to do after a power failure. These are 
usually turn on, stay off, or 'auto', to return to whichever state it was in 
before. I set mine to auto, which led to a problematic situation!

The server was not needed for a while so I did a normal power-off shutdown 
procedure, then turned off the server's PSU via a remote controlled PSU.

Later I came to turn it back on and found I was stuck! Because it was in auto 
mode and had been shut down correctly, it was stuck turned off - turning on the 
PSU had no effect, so I had to call out an engineer to go and press the power 
button. The moral of the story is to leave the BIOS set to 'turn on' and then 
turn it off at the PSU; do not use the 'auto' mode!

This particular server had no IPMI/ILO facility so I couldn't tell it to turn 
on that way.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ubuntu precise repo?

2012-12-10 Thread Marcus Bointon
On 10 Dec 2012, at 02:25, Andrew Beekhof and...@beekhof.net wrote:

 Its part of the distro, no need to add anything:
 
   http://clusterlabs.org/quickstart-ubuntu.html

Excellent! Thanks,

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Ubuntu precise repo?

2012-12-07 Thread Marcus Bointon
Is there an HA repo for Ubuntu Precise? 
https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa doesn't go that far.

Marcus
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Leave Apache running on both active and passive nodes?

2012-08-21 Thread Marcus Bointon
On 21 Aug 2012, at 22:06, David Lang david_l...@intuit.com wrote:
 
 One legitimate reason for doing this is that you can then have heartbeat 
 'monitor' the webserver and if the webserver dies, initiate a failover. 
 However 
 I think this is better done by having a dummy service that takes no time to 
 start/stop and implements it's status with a file and then have some other, 
 more 
 extensive monitoring system checking your web front end (checking that it 
 actually works, not just that apache is running) and altering the status file 
 that heartbeat checks.
 
 Or you can have your monitoring software send a message to heartbeat to 
 trigger 
 a failover.

Well haproxy does all that out of the box, no tricks or tweakery required. For 
monitoring services within a single server, I'm finding monit works well. If a 
web server server fails, haproxy will see that (from outside) and stop sending 
it traffic, and monit (on the server) will give it a kick and send appropriate 
notifications. That setup has coped with most of the problems that have come my 
way to date.

Another thing I like about haproxy is that it's unnervingly fast; 
start/stop/reload are effectively instantaneous. I often find that crm_mon and 
heartbeat services take ages to do anything, and it's never clear whether it's 
just taking a long time or if something's wrong.

I'm running heartbeat + pacemaker/crm at the moment. I've had a couple of 
attempts at migrating to corosync, but so far I've had no success and a great 
deal of confusion, even though all I'm doing is managing a single IP.

As Dmitri said, heartbeat has other strengths, especially when it comes to more 
complex clusters with multiple services and dependencies, and the power can't 
be denied!

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Nodes not seeing each other

2012-07-09 Thread Marcus Bointon
On 27 Jun 2012, at 00:39, Andreas Kurz wrote:

 If the network is working as expected again, Heartbeat should reconnect
 automatically ... if not, restart Heartbeat if you are confident the
 network problem is solved.

I finally arranged for possible downtime to permit me to try this. I restarted 
heartbeat on one node and it fell offline. I rebooted it and it came back, but 
heartbeat returned to the same split-brain state where neither node could see 
the other.

After some rummaging I found what the problem was: an ipaddr2 resource had been 
configured using one nodes primary static IP, which had been migrated to the 
other node, resulting in it falling offline, but making it look like it was up 
because it was pointing at the wrong node! Not pretty.

I then found I couldn't delete the incorrect ip resource as it refused to stop 
- is there some way to force stop/delete? Once I'd resolved that, I ran into 
problems getting pacemaker to start - heartbeat processes were ok, but not the 
pacemaker ones like cib. Some reboots and networking restarts eventually solved 
that.

This setup is running heartbeat 3.0.5 and pacemaker 1.1.6 from the 
ubuntu-ha-maintainers ppa. Is corosync generally more robust than heartbeat? 
Would it be worth upgrading to it?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state

2012-07-03 Thread Marcus Bointon
On 3 Jul 2012, at 12:26, darren.mans...@opengi.co.uk 
darren.mans...@opengi.co.uk wrote:

 Out of interest Lars, why do you recommend XFS?

I'd second that. Percona has benchmarks for MySQL on XFS being in come cases 
twice as fast as ext3.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat Failover Configuration Question

2012-04-23 Thread Marcus Bointon
On 23 Apr 2012, at 02:23, Net Warrior wrote:

 auto_failback on

No. As far as I'm aware this is to control what happens when your initial node 
recovers. If you have 2 nodes, a and b, and a is active, but then fails, b will 
take over, but when a is fixed and recovers, heartbeat will 'fail back' to a 
automatically if this property is on. You might want this if a is a 
faster/better server.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] o2cb Pacemaker Stack glue driver not loaded

2012-03-01 Thread Marcus Bointon
On 1 Mar 2012, at 10:25, Stefan Schloesser wrote:

 I would like load-balancing and use typo3 which writes upon access to the 
 filesystem and db (cache etc.). Still pointless? 

I do (well, I will be again when I get corosync/pacemaker working again!) 
something similar using a managed IP in front of haproxy/stunnel/apache with 
GlusterFS for a shared file system. Seems about as simple as I could make it - 
I don't see any point using pacemaker to manage haproxy/apache when it can all 
just happen behind the floating IP.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] o2cb Pacemaker Stack glue driver not loaded

2012-03-01 Thread Marcus Bointon
On 1 Mar 2012, at 11:25, Stefan Schloesser wrote:

 My setup would involve 2 loadbalancer and 2 nodes. Are you saying that 
 running GlusterFs on both nodes using its replication feature is easier + 
 more reliable than DRBD + ocfs2 + pacemaker?

I can't compare reliability as I've never used DRBD, but gluster has worked 
fine for me for several years. Historically I've always found heartbeat etc 
very difficult to deal with, so I try to use it in as simple a way as possible, 
i.e. just managing a single IP.

 And you use the haproxy/stunnel to monitor availability of the nodes (apache) 
 ?

Yes, haproxy is pretty good at that and it works beautifully (and it has a nice 
status page too). My two nodes are set up identically with sysctl set to allow 
binding to non-local addresses so haproxy can be set to listen on the floating 
IP even when it's not on the local machine. stunnel is a very simple thing - 
it's just a pipe really. You could use pound instead (it has SSL integrated), 
but I prefer haproxy's config system. One key thing is that the servers don't 
have to DO anything at failover time - the software is all already up and 
running (and easily testable since it has its own IP), it just starts receiving 
traffic when it gets the floating IP. I happen to be running proxies and web 
servers on the same nodes, but you could split them up if you want - haproxy is 
extremely fast and uses almost no resources.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] libpthread segfaults

2012-02-29 Thread Marcus Bointon
I'v scrapped my old heartbeat config and I'm trying to start from a clean slate 
with corosync/pacemaker installed on Ubuntu Lucid from the ubuntu-ha PPA 
(http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu). I'm running corosync 
1.2.0-0ubuntu1 and pacemaker 1.0.8+hg15494-2ubuntu2.

I have one server that is happy, but the other is segfaulting in libpthread in 
attrd and cib. Everything else on the server appears to be working ok.

There's a chunk of the log file here: http://pastie.org/3486981 Example 
segfaults:

Feb 29 09:40:27 www4 kernel: attrd[16632]: segfault at 8 ip 7f563a5970e8 sp 
7fff89a6a7b8 error 6 in libpthread-2.11.1.so[7f563a58a000+18000]
Feb 29 09:40:27 www4 kernel: cib[16630]: segfault at 8 ip 7f6425fe60e8 sp 
7fff31f29858 error 6 in libpthread-2.11.1.so[7f6425fd9000+18000]

I don't know how to get a stack trace of these as I don't know how these 
programs are started.

Is this a known problem?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] libpthread segfaults

2012-02-29 Thread Marcus Bointon
On 29 Feb 2012, at 14:23, Florian Haas wrote:

 You should really run on Corosync 1.4.2+ and Pacemaker 1.1.5+. And
 that's what that PPA has. The versions you're running are pretty
 ancient. :)

Well since none of it's working, I have no problem throwing it all away and 
starting again!

Thanks very much,

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] libpthread segfaults

2012-02-29 Thread Marcus Bointon
On 29 Feb 2012, at 14:28, Marcus Bointon wrote:

 Well since none of it's working, I have no problem throwing it all away and 
 starting again!

My crashes have gone away, but I have other issues with the same server. The 
corosync service starts, and is found by the other node:


Last updated: Wed Feb 29 15:07:55 2012
Last change: Wed Feb 29 15:00:10 2012 via crmd on www5
Stack: openais
Current DC: www5 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.


Node www4: pending
Online: [ www5 ]

Running 'crm status' on www4 just gives Connection to cluster failed: 
connection failed. In the log I have these lines from cib:

Feb 29 15:00:18 www4 cib: [24712]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Feb 29 15:00:18 www4 cib: [24712]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (diges
t: /var/lib/heartbeat/crm/cib.xml.sig)
Feb 29 15:00:18 www4 cib: [24712]: WARN: retrieveCib: Cluster configuration not 
found: /var/lib/heartbeat/crm/cib.xml
Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Primary configuration 
corrupt or unusable, trying backup...
Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Continuing with an 
empty configuration.
Feb 29 15:00:18 www4 cib: [24712]: info: validate_with_relaxng: Creating RNG 
parser context
Feb 29 15:00:18 www4 corosync[24705]:   [pcmk  ] info: spawn_child: Forked 
child 24712 for process cib
Feb 29 15:00:18 www4 cib: [24712]: info: startCib: CIB Initialization completed 
successfully
Feb 29 15:00:18 www4 cib: [24712]: info: get_cluster_type: Cluster type is: 
'openais'
Feb 29 15:00:18 www4 cib: [24712]: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: classic openais (with plu
gin)
Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: Creating 
connection to our Corosync plugin
Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: 
Connection to our AIS plugin (9) failed: Library error (2
)
Feb 29 15:00:18 www4 cib: [24712]: CRIT: cib_init: Cannot sign in to the 
cluster... terminating

cib appears to be fine on www5. I've never touched anything in 
/var/lib/heartbeat/crm - this is a completely vanilla config, though it may be 
that there are remnants of the old heartbeat config (which was only on www4) 
causing this. Can I just copy the contents of that folder from the other server?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] libpthread segfaults

2012-02-29 Thread Marcus Bointon
On 29 Feb 2012, at 17:43, Florian Haas wrote:

 My hunch is that you never properly shut down corosync on that one.
 Did you check your ps output so see if it was really down? Corosync
 1.2.x had some nasty shutdown issues when running with Pacemaker.

I shut down or killed anything vaguely related to 
corosync/crm/heartbeat/crm/cib and restarted corosync and pacemaker.

Now on www4 I can see a pacemaker process with crmd, pengine, lrmd and stonithd 
child processes, and on www5 I see those plus attrd and cib (which curiously 
are the same processes that were reporting segfaults when I was running the old 
version). www4 is correspondingly still failing to connect to cib.

Starting corosync by itself appears to work correctly on both - the logs show 
they see each other, no errors.

If on www4 I start attrd and cib manually (as root), they do run, and crm then 
manages to connect but reports no nodes. crm on www5 sees www4, but it's marked 
as 'pending'. pcmk on www4 logs that it can see www5.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] libpthread segfaults

2012-02-29 Thread Marcus Bointon
On 29 Feb 2012, at 21:03, Florian Haas wrote:

 And you're sure you've got a healthy Corosync membership?
 corosync-cfgtool -s shows all rings healthy? corosync-objctl | grep
 member shows 2 members?

I'm not sure what the output is supposed to look like, but it certainly gives 
the impression of being healthy:

On www4:
Printing ring status.
Local node ID 192961885
RING ID 0
id  = 192.168.0.11
status  = ring 0 active with no faults

On www5:
Printing ring status.
Local node ID 343956829
RING ID 0
id  = 192.168.0.148
status  = ring 0 active with no faults

Are they each meant to show both nodes here?

On both nodes:
runtime.totem.pg.mrp.srp.members.343956829.ip=r(0) ip(192.168.0.148) 
runtime.totem.pg.mrp.srp.members.343956829.join_count=1
runtime.totem.pg.mrp.srp.members.343956829.status=joined
runtime.totem.pg.mrp.srp.members.192961885.ip=r(0) ip(192.168.0.11) 
runtime.totem.pg.mrp.srp.members.192961885.join_count=1
runtime.totem.pg.mrp.srp.members.192961885.status=joined

But crm status gives this on www4 (this is still running my manually launched 
cib/attrd):


Last updated: Wed Feb 29 20:14:08 2012
Last change: Wed Feb 29 17:34:55 2012
Current DC: NONE
0 Nodes configured, unknown expected votes
0 Resources configured.


and this on www5


Last updated: Wed Feb 29 20:14:01 2012
Last change: Wed Feb 29 17:29:20 2012 via crmd on www5
Stack: openais
Current DC: www5 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.


Node www4: pending
Online: [ www5 ]

Any the wiser?

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] MMM conflict with Pacemaker

2012-02-16 Thread Marcus Bointon
I have 5 servers where 2 are running a redundant web front-end with pacemaker 
(managing a single floating IP), two are running MySQL with mmm agents and the 
last one is running the mmm monitor node. So at present there is no overlap 
between these groups. I need to retire one of the web servers and its functions 
will be moved to the machine currently doing mmm monitoring. Easier said than 
done.
If I install pacemaker (from the linux-ha PPA for Lucid, with empty initial 
config, as per the docs) and start its corosync service,  mmm's monitor goes 
nuts, loses connectivity to agents causes them to drops their floating IP (even 
though it's not on the machines involved with  pacemaker). I can appreciate 
that there is some overlap in functionality, but I don't see why it should 
conflict like this. Anyone got an explanation? Is anyone else running this 
combo?

I've temporarily bypassed the front-end so I can work on this, so I'm clear to 
start entirely from scratch. This is proving difficult too, since the shifting 
terminology means documentation is mostly out of sync - of the three guides 
I've tried so far, one doesn't mention ha.cf at all (others do, but with 
obsolete options), one suggests doing everything with corosync (though appears 
to be missing any config for pacemaker). One thing that would be very helpful 
is something to explain the relative merits of ucast, bcast and mcast options, 
as I suspect they may be part of the problem I'm seeing with mmm.

(and I'm not looking to switch to DRBD!)

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MMM conflict with Pacemaker

2012-02-16 Thread Marcus Bointon
On 16 Feb 2012, at 18:00, Mark Grennan wrote:

 Yes HA systems are very confusing.

It's not so much that - it's more that heartbeat/crm/pacemaker/corosync is 
confusing, not least because it keeps changing its name. Constant changing of 
names, nomenclature and config settings guarantees that any articles written 
about it won't work for long.

 Pacemaker is the name of an older application.  Corasync is it's new name but 
 some of the files still maintain the old name.

Huh? So why does corosync need setting up to work with pacemaker if it is now 
pacemaker? Even your doc installs them (and heartbeat) from separate packages!

 One Issue I can think of is, Pacemaker wants to bind the floating IP as 
 eth#:#, while MMM wants to use a different method that can only be seen with 
 the IP command.   I think they are fighting over who owns the floating IP.

But pacemaker isn't even running on the machines the mmm float is on! It's 
somehow interfering with the monitoring node, not the float that it's managing. 
I don't have a problem with using the ip command - I was under the impression 
it's how things are supposed to be done now? I've seen mixtures of 
ifconfig-style network config coexisting quite happily with ip-style ones 
before.

My original config:

server1: pacemaker
server2: pacemaker
server3: mmm monitor
server4: mmm agent
server5: mmm agent

There is a floating IP on servers 1 and 2, and another one on servers 4 and 5.

What I want to change to:

server2: pacemaker
server3: pacemaker + mmm monitor
server4: mmm agent
server5: mmm agent

Here there is a floating IP on 2 and 3, and another on 4 and 5. I don't see any 
reason they should conflict since there is no overlap of machines that floats 
are on. What seems to happen is that as soon as corosync is started, the mmm 
monitor can no longer see the network at all. I suspect this could be something 
to do with the suggested setting of using the network address for bindnetaddr 
in corosync.

I'm still mystified by whether I should use ucast, mcast or bcast - previous 
setups I've done with crm have used ucast. I see in your example you're binding 
to a private IP for corosync, but I can't understand why you're using a public 
IP for mcast, or why it's even there at all.

Your guide wasn't one of the ones I'd found, so thanks for the pointer. The 
most interesting one for me was this one, since it is closest to my own config 
and seems quite recent (i.e. it even mentions corosync): 
https://wiki.ubuntu.com/ClusterStack/LucidTesting
The official 'cluster from scratch' PDF skips over quite a few bits of vital 
info, so I found I couldn't really use it.

My mmm config was originally installed by Percona, and I've done several others 
since. mmm has always worked beautifully for me (even through multiple hardware 
and network failures), and the main complaint I've seen about it (1062 errors) 
is nothing to do with mmm. I fully understand that it has problems, however it 
has the advantage of being very stable and trivially easy to understand and 
configure. While I keep reading good things about pacemaker, the practical 
aspects of getting it to work have always turned into a yak-shaving festival, 
so I've always been put off pursuing it for anything beyond management of a 
single IP. One critical aspect of an HA system is that it should be really easy 
to deal with when things go wrong; I'd put xtrabackup in this category - it's 
great (though I hope you have automated tests for your restores as it went 
through a patch late last year when they were broken!).

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Basic 2-node floating IP setup

2009-12-02 Thread Marcus Bointon
On 2 Dec 2009, at 15:18, David Lang wrote:

 I thought you needed a different node line for each server
 
 node www1
 node www2

Docs say you can: http://www.linux-ha.org/ha.cf#node

 unless you have a specific reason to allow the two IP addresses to exist on 
 two 
 different machines you should put them both in the same line
 
 www1 182.158.1.3 192.158.1.4

OK.

 I've added this to my /etc/sysctl.conf, which was apparently necessary to 
 allow the floating IPs to exist:
 net.ipv4.ip_nonlocal_bind=1
 
 this isn't needed to let floating IPs exist, but it is needed to let software 
 startup that wants to use these IP addresses when the box isn't active.

Ah, ok, I knew it was something like that.

heartbeat didn't re-read config with a simple restart, needed a full 
stop/start. After doing this, both floating IPs are up and it seems to be 
working!

Thanks for your help,

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK resellers of i...@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Basic 2-node floating IP setup

2009-12-01 Thread Marcus Bointon
I'm trying to set up HA on an ubuntu cluster. The docs are fairly 
comprehensive, but I still can't make sufficient sense out of them or find any 
sufficiently matching examples.
My scenario is pretty standard - two nodes running pound and HA managing two 
floating IPs between the two. I'm using the stock Ubuntu 2.1.4 package with an 
HA 1.0 style config. The main thing I can't quite figure out is exactly what to 
do with the floating IPs. As far as I can see, the floating IPs are purely 
resources to be shared by my nodes and as such are not nodes themselves, and 
don't need to appear in ha.cf. Having said that, I have use of another HA 
config done by someone else that lists a floating IP as a node and ucast, and 
it works fine, which has me confused.

Say that my two nodes are 192.158.1.1 and 192.168.1.2, call them www1 and www2, 
and my floating IPs are 192.168.1.3 and 192.168.1.4.

So far my ha.cf looks like this:

node www1 www2
ucast eth0 192.158.1.1
ucast eth0 192.158.1.2
deadtime 5
deadping 5
debug 0

and haresources looks like this:

www1 192.158.1.3
www1 192.158.1.4

I've added this to my /etc/sysctl.conf, which was apparently necessary to allow 
the floating IPs to exist:
net.ipv4.ip_nonlocal_bind=1

Does all that look right? Anything I've missed? Do the floating IPs need to 
appear in ha.cf as ucast lines (like my other setup)? Is my other setup wrong?

Thanks,

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK resellers of i...@hand CRM solutions
mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems