Re: [Linux-HA] HA Summit Key-signing Party (was: Organizing HA Summit 2015)
I have some keybase.io invitations if anyone wants one. Marcus -- Marcus Bointon Technical Director, Synchromedia Limited Creators of http://www.smartmessages.net/ UK 1CRM solutions http://www.syniah.com/ mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ signature.asc Description: Message signed with OpenPGP using GPGMail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How many primitives, groups can I have
On 11 Nov 2013, at 13:57, Michael Brookhuis mimabr...@googlemail.com wrote: Is there a limit in the number of proimitives, etc you can have? What maximum number is recommended based on best-practices? Are 1500 to many? I think it depends on your transport layer. If you're using heartbeat I think I ran into a problem where the whole resource definitions had to fit into one packet, which was typically around 30. I think corosync removed that limit. I'm hazy on the details though. Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] The Heartbeat of learning
On 16 Oct 2013, at 16:44, Digimer li...@alteeve.ca wrote: Corosync uses the totem protocol for heartbeat like monitoring of the... Thank you for the clearest summary of the HA stack I've seen so far! It should be on the HA site somewhere (it probably is...!) Marcus signature.asc Description: Message signed with OpenPGP using GPGMail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Need some help, Two node M/S IPAddr2 IP Failover, both nodes slaves
On 9 Aug 2013, at 06:29, Gary Mazzaferro ga...@oedata.com wrote: Additionally, the traffic to the nodes seem interleaved, random connection to node1/node2 from clients. And, when I shut down node2 or place it in standby, the VIP doesn't shift to node1, it appears the the node1 is down. This sounds like you may not have an ARP resource grouped with the IP, so switches are serving to cached nodes. This is my usual config for managing a floating IP: primitive ip2 ocf:heartbeat:IPaddr2 params ip=x.x.x.x cidr_netmask=24 op monitor interval=10s nic=eth0 primitive ip2arp ocf:heartbeat:SendArp params ip=x.x.x.x nic=eth0 group proxyfloat2 ip2 ip2arp location cli-standby-proxyfloat2 proxyfloat2 rule $id=cli-standby-rule-proxyfloat2 -inf: #uname eq proxy1 and #uname eq proxy2 colocation ip_with_arp2 inf: ip2 ip2arp order arp_after_ip2 inf: ip2:start ip2arp:start Change your IPs and node names as appropriate. Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] MailTo resources, 'message too long' errors
On 6 Aug 2013, at 11:10, Dejan Muhamedagic deja...@fastmail.fm wrote: Which compression setting do you have now? I think you should try with different compression settings as suggested there by Lars. I had no compression set at all. I added the settings as advised in that posting and it does seem to have solved the problem for now, though clearly I need to move to corosync ASAP before my CIB gets big enough to break again! Any idea on how to improve email notifications? At the moment the only notifications I get out of the cluster amount to 'something happened'. Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] MailTo resources, 'message too long' errors
I have two nodes running heartbeat 3.0.5 and pacemaker 1.1.6 (both from the linux-ha lucid ppa). They are running 11 groups each comprising an ocf:heartbeat:IPaddr2, an ocf:heartbeat:SendArp and an ocf:heartbeat:MailTo. There is also a mailto resource configured for the overall cluster. Despite all these, all the notifications I ever receive look identical: Heartbeat status change: Migrating resource away at Mon Aug 5 13:01:49 UTC 2013 from proxy2 Command line was: /usr/lib/ocf/resource.d//heartbeat/MailTo stop One major omission here is that it doesn't tell me which resource it migrated. Is there some way of configuring the cluster itself to send notifications so that I can remove the individual mailto resources? Coincidentally (?), I've just started to get this problem: Aug 5 11:13:50 proxy1 heartbeat: [2958]: ERROR: glib: ucast_write: Unable to send HBcomm packet eth0 192.168.1.10:694 len=78903 [-1]: Message too long Aug 5 11:13:50 proxy1 heartbeat: [2958]: ERROR: write_child: write failure on ucast eth0.: Message too long This (well at least I assume it's this) is resulting in resources disappearing, randomly starting and stopping, flip-flopping between nodes, marking nodes as offline and more fun things to keep us awake at night. The only explanation I've found for this is here http://comments.gmane.org/gmane.linux.highavailability.pacemaker/10765 The solutions suggested are to alter compression settings (which we were not using before), migrate to corosync and/or to make the cib smaller, hence the idea of removing the individual mailtos. I've run hb_report and that doesn't say anything useful, more or less it doesn't work. I'd like to migrate to corosync if it's better, but I'm extremely wary of touching anything in the cluster. Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource moves
On 19 Apr 2013, at 14:48, Florian Crouzat gen...@floriancrouzat.net wrote: Well, you kinda answered this when you mentioned crm_resource -U. You should use unmove instead of move. Unmove will remove the location constraint where move will create a new one. Thanks, that sounds much better - but it has a slightly annoying effect. I normally have a location rule like this: location cli-prefer-ip3 ip3 rule $id=cli-prefer-rule-ip3 inf: #uname eq proxy1 When I issue a move, it gets removed and replaced with: location cli-standby-ip3 ip3 \ rule $id=cli-standby-rule-ip3 -inf: #uname eq proxy1 When I unmove, it deletes that location rule, but because the preference rule has been removed, it doesn't result in the ip moving back because there is no longer any preferred node, so no incentive for it to do so. This seems to rather defeat the point of unmove. Is there a move/unmove that doesn't apply location rules, but just tells the resource to move? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource moves
On 19 Apr 2013, at 15:37, Florian Crouzat gen...@floriancrouzat.net wrote: Maybe you should start using different CIBs, each one of them containing a certain set of location constraints. I wouldn't know where to start with that - but is there really no way to tell a resource to go to a particular node without creating persistent location rules? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource moves
On 19 Apr 2013, at 14:48, Florian Crouzat gen...@floriancrouzat.net wrote: About move and constraints, it all goes down to a design choice, but to me, it makes sense (for the reasons I mentioned in my first answer) and it's documented, so ... :) When you, as an admin says otherwise, the cluster trusts you and create a location constraint representing the administrative decision you just took. While I am obviously the ultimate decider of what goes where, this mechanism doesn't allow me to separate these intentions: * Move resource x to node 1 now * Move resource x to node 1 now and never allow it to come back As far as I can see only the latter is possible, if I'm to believe the This will be the case even if node 1 is the last node in the cluster warning. It's obviously possible to have a resource sitting a node and have no applicable location rules, and yet it stays put. Since I can create location rules that may result in implicit moves, and I can issue move commands too, it doesn't seem necessary that the two should be tied together. I think the most practical solution is to always follow a move with an unmove - though it's pretty counter-intuitive and clumsy, kind of like trying to drive a car by issuing written instructions... Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Resource moves
On 19 Apr 2013, at 16:41, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: Let me comment: crm resource migrate prm_yours PT2M will make a constraint that will stay in your CIB forever also, but it's active only for 2 minutes. Where does the 2 minutes come from? As far as I can see they stick around until you delete them? Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Resource move not moving
I'm running crm using heartbeat 3.0.5 pacemaker 1.1.6 on Ubuntu Lucid 64. I have a small resource group containing an IP, ARP and email notifier on a cluster containing two nodes called proxy1 and proxy2. I asked it to move nodes, and it seems to say that was ok, but it hasn't actually moved, and crm_mon still shows it on the original node. # crm resource move proxyfloat3 WARNING: Creating rsc_location constraint 'cli-standby-proxyfloat3' with a score of -INFINITY for resource proxyfloat3 on proxy1. This will prevent proxyfloat3 from running on proxy1 until the constraint is removed using the 'crm_resource -U' command or manually with cibadmin This will be the case even if proxy1 is the last node in the cluster This message can be disabled with -Q This was in syslog: Apr 16 13:32:35 proxy1 cib: [2948]: info: cib_process_request: Operation complete: op cib_delete for section constraints (origin=local/crm_resource/3, version=0.57.2): ok (rc=0) Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: - cib admin_epoch=0 epoch=57 num_updates=2 / Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + cib validate-with=pacemaker-1.0 crm_feature_set=3.0.5 have-quorum=1 admin_epoch=0 epoch=58 num_updates=1 cib-last-written=Tue Apr 16 08:52:01 2013 dc-uuid=68890308-615b-4b28-bb8b-5aa00bdbf65c Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + configuration Apr 16 13:32:35 proxy1 crmd: [2952]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.58.1) : Non-status change Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + constraints Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + rsc_location id=cli-standby-proxyfloat3 rsc=proxyfloat3 Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + rule id=cli-standby-rule-proxyfloat3 score=-INFINITY boolean-op=and Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_pe_invoke: Query 150: Requesting the current CIB: S_POLICY_ENGINE Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + expression id=cli-standby-expr-proxyfloat3 attribute=#uname operation=eq value=proxy1 type=string __crm_diff_marker__=added:top / Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /rule Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /rsc_location Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /constraints Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /configuration Apr 16 13:32:35 proxy1 cib: [2948]: info: cib:diff: + /cib Apr 16 13:32:35 proxy1 cib: [2948]: info: cib_process_request: Operation complete: op cib_modify for section constraints (origin=local/crm_resource/4, version=0.58.1): ok (rc=0) Yet crm status still shows: Resource Group: proxyfloat3 ip3(ocf::heartbeat:IPaddr2): Started proxy1 ip3arp (ocf::heartbeat:SendArp): Started proxy1 ip3email (ocf::heartbeat:MailTo):Started proxy1 So if all that's true, why is that resource group still on the original node? Is there something else I need to do? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource move not moving
cause=C_FSA_INTERNAL origin=notify_crmd ] Apr 16 13:32:35 proxy1 crmd: [2952]: info: do_state_transition: Starting PEngine Recheck Timer Apr 16 13:32:35 proxy1 pengine: [28796]: notice: process_pe_message: Transition 25: PEngine Input stored in: /var/lib/pengine/pe-input-127.bz2 Apr 16 13:33:17 proxy1 cib: [2948]: info: cib_stats: Processed 5 operations (6000.00us average, 0% utilization) in the last 10min There's a lot there, but nothing that clearly says I moved proxyfloat3, or more to the point, didn't... Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] PSU tip
This is a little something I ran into a while ago, and it occurred to me it might be of interest to anyone doing fencing or other power control operations. Many servers have a BIOS option for what to do after a power failure. These are usually turn on, stay off, or 'auto', to return to whichever state it was in before. I set mine to auto, which led to a problematic situation! The server was not needed for a while so I did a normal power-off shutdown procedure, then turned off the server's PSU via a remote controlled PSU. Later I came to turn it back on and found I was stuck! Because it was in auto mode and had been shut down correctly, it was stuck turned off - turning on the PSU had no effect, so I had to call out an engineer to go and press the power button. The moral of the story is to leave the BIOS set to 'turn on' and then turn it off at the PSU; do not use the 'auto' mode! This particular server had no IPMI/ILO facility so I couldn't tell it to turn on that way. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ubuntu precise repo?
On 10 Dec 2012, at 02:25, Andrew Beekhof and...@beekhof.net wrote: Its part of the distro, no need to add anything: http://clusterlabs.org/quickstart-ubuntu.html Excellent! Thanks, Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Ubuntu precise repo?
Is there an HA repo for Ubuntu Precise? https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa doesn't go that far. Marcus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Leave Apache running on both active and passive nodes?
On 21 Aug 2012, at 22:06, David Lang david_l...@intuit.com wrote: One legitimate reason for doing this is that you can then have heartbeat 'monitor' the webserver and if the webserver dies, initiate a failover. However I think this is better done by having a dummy service that takes no time to start/stop and implements it's status with a file and then have some other, more extensive monitoring system checking your web front end (checking that it actually works, not just that apache is running) and altering the status file that heartbeat checks. Or you can have your monitoring software send a message to heartbeat to trigger a failover. Well haproxy does all that out of the box, no tricks or tweakery required. For monitoring services within a single server, I'm finding monit works well. If a web server server fails, haproxy will see that (from outside) and stop sending it traffic, and monit (on the server) will give it a kick and send appropriate notifications. That setup has coped with most of the problems that have come my way to date. Another thing I like about haproxy is that it's unnervingly fast; start/stop/reload are effectively instantaneous. I often find that crm_mon and heartbeat services take ages to do anything, and it's never clear whether it's just taking a long time or if something's wrong. I'm running heartbeat + pacemaker/crm at the moment. I've had a couple of attempts at migrating to corosync, but so far I've had no success and a great deal of confusion, even though all I'm doing is managing a single IP. As Dmitri said, heartbeat has other strengths, especially when it comes to more complex clusters with multiple services and dependencies, and the power can't be denied! Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Nodes not seeing each other
On 27 Jun 2012, at 00:39, Andreas Kurz wrote: If the network is working as expected again, Heartbeat should reconnect automatically ... if not, restart Heartbeat if you are confident the network problem is solved. I finally arranged for possible downtime to permit me to try this. I restarted heartbeat on one node and it fell offline. I rebooted it and it came back, but heartbeat returned to the same split-brain state where neither node could see the other. After some rummaging I found what the problem was: an ipaddr2 resource had been configured using one nodes primary static IP, which had been migrated to the other node, resulting in it falling offline, but making it look like it was up because it was pointing at the wrong node! Not pretty. I then found I couldn't delete the incorrect ip resource as it refused to stop - is there some way to force stop/delete? Once I'd resolved that, I ran into problems getting pacemaker to start - heartbeat processes were ok, but not the pacemaker ones like cib. Some reboots and networking restarts eventually solved that. This setup is running heartbeat 3.0.5 and pacemaker 1.1.6 from the ubuntu-ha-maintainers ppa. Is corosync generally more robust than heartbeat? Would it be worth upgrading to it? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: mount.ocfs2 in D state
On 3 Jul 2012, at 12:26, darren.mans...@opengi.co.uk darren.mans...@opengi.co.uk wrote: Out of interest Lars, why do you recommend XFS? I'd second that. Percona has benchmarks for MySQL on XFS being in come cases twice as fast as ext3. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat Failover Configuration Question
On 23 Apr 2012, at 02:23, Net Warrior wrote: auto_failback on No. As far as I'm aware this is to control what happens when your initial node recovers. If you have 2 nodes, a and b, and a is active, but then fails, b will take over, but when a is fixed and recovers, heartbeat will 'fail back' to a automatically if this property is on. You might want this if a is a faster/better server. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] o2cb Pacemaker Stack glue driver not loaded
On 1 Mar 2012, at 10:25, Stefan Schloesser wrote: I would like load-balancing and use typo3 which writes upon access to the filesystem and db (cache etc.). Still pointless? I do (well, I will be again when I get corosync/pacemaker working again!) something similar using a managed IP in front of haproxy/stunnel/apache with GlusterFS for a shared file system. Seems about as simple as I could make it - I don't see any point using pacemaker to manage haproxy/apache when it can all just happen behind the floating IP. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] o2cb Pacemaker Stack glue driver not loaded
On 1 Mar 2012, at 11:25, Stefan Schloesser wrote: My setup would involve 2 loadbalancer and 2 nodes. Are you saying that running GlusterFs on both nodes using its replication feature is easier + more reliable than DRBD + ocfs2 + pacemaker? I can't compare reliability as I've never used DRBD, but gluster has worked fine for me for several years. Historically I've always found heartbeat etc very difficult to deal with, so I try to use it in as simple a way as possible, i.e. just managing a single IP. And you use the haproxy/stunnel to monitor availability of the nodes (apache) ? Yes, haproxy is pretty good at that and it works beautifully (and it has a nice status page too). My two nodes are set up identically with sysctl set to allow binding to non-local addresses so haproxy can be set to listen on the floating IP even when it's not on the local machine. stunnel is a very simple thing - it's just a pipe really. You could use pound instead (it has SSL integrated), but I prefer haproxy's config system. One key thing is that the servers don't have to DO anything at failover time - the software is all already up and running (and easily testable since it has its own IP), it just starts receiving traffic when it gets the floating IP. I happen to be running proxies and web servers on the same nodes, but you could split them up if you want - haproxy is extremely fast and uses almost no resources. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] libpthread segfaults
I'v scrapped my old heartbeat config and I'm trying to start from a clean slate with corosync/pacemaker installed on Ubuntu Lucid from the ubuntu-ha PPA (http://ppa.launchpad.net/ubuntu-ha/ppa/ubuntu). I'm running corosync 1.2.0-0ubuntu1 and pacemaker 1.0.8+hg15494-2ubuntu2. I have one server that is happy, but the other is segfaulting in libpthread in attrd and cib. Everything else on the server appears to be working ok. There's a chunk of the log file here: http://pastie.org/3486981 Example segfaults: Feb 29 09:40:27 www4 kernel: attrd[16632]: segfault at 8 ip 7f563a5970e8 sp 7fff89a6a7b8 error 6 in libpthread-2.11.1.so[7f563a58a000+18000] Feb 29 09:40:27 www4 kernel: cib[16630]: segfault at 8 ip 7f6425fe60e8 sp 7fff31f29858 error 6 in libpthread-2.11.1.so[7f6425fd9000+18000] I don't know how to get a stack trace of these as I don't know how these programs are started. Is this a known problem? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] libpthread segfaults
On 29 Feb 2012, at 14:23, Florian Haas wrote: You should really run on Corosync 1.4.2+ and Pacemaker 1.1.5+. And that's what that PPA has. The versions you're running are pretty ancient. :) Well since none of it's working, I have no problem throwing it all away and starting again! Thanks very much, Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] libpthread segfaults
On 29 Feb 2012, at 14:28, Marcus Bointon wrote: Well since none of it's working, I have no problem throwing it all away and starting again! My crashes have gone away, but I have other issues with the same server. The corosync service starts, and is found by the other node: Last updated: Wed Feb 29 15:07:55 2012 Last change: Wed Feb 29 15:00:10 2012 via crmd on www5 Stack: openais Current DC: www5 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 0 Resources configured. Node www4: pending Online: [ www5 ] Running 'crm status' on www4 just gives Connection to cluster failed: connection failed. In the log I have these lines from cib: Feb 29 15:00:18 www4 cib: [24712]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Feb 29 15:00:18 www4 cib: [24712]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (diges t: /var/lib/heartbeat/crm/cib.xml.sig) Feb 29 15:00:18 www4 cib: [24712]: WARN: retrieveCib: Cluster configuration not found: /var/lib/heartbeat/crm/cib.xml Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Primary configuration corrupt or unusable, trying backup... Feb 29 15:00:18 www4 cib: [24712]: WARN: readCibXmlFile: Continuing with an empty configuration. Feb 29 15:00:18 www4 cib: [24712]: info: validate_with_relaxng: Creating RNG parser context Feb 29 15:00:18 www4 corosync[24705]: [pcmk ] info: spawn_child: Forked child 24712 for process cib Feb 29 15:00:18 www4 cib: [24712]: info: startCib: CIB Initialization completed successfully Feb 29 15:00:18 www4 cib: [24712]: info: get_cluster_type: Cluster type is: 'openais' Feb 29 15:00:18 www4 cib: [24712]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plu gin) Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: Creating connection to our Corosync plugin Feb 29 15:00:18 www4 cib: [24712]: info: init_ais_connection_classic: Connection to our AIS plugin (9) failed: Library error (2 ) Feb 29 15:00:18 www4 cib: [24712]: CRIT: cib_init: Cannot sign in to the cluster... terminating cib appears to be fine on www5. I've never touched anything in /var/lib/heartbeat/crm - this is a completely vanilla config, though it may be that there are remnants of the old heartbeat config (which was only on www4) causing this. Can I just copy the contents of that folder from the other server? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] libpthread segfaults
On 29 Feb 2012, at 17:43, Florian Haas wrote: My hunch is that you never properly shut down corosync on that one. Did you check your ps output so see if it was really down? Corosync 1.2.x had some nasty shutdown issues when running with Pacemaker. I shut down or killed anything vaguely related to corosync/crm/heartbeat/crm/cib and restarted corosync and pacemaker. Now on www4 I can see a pacemaker process with crmd, pengine, lrmd and stonithd child processes, and on www5 I see those plus attrd and cib (which curiously are the same processes that were reporting segfaults when I was running the old version). www4 is correspondingly still failing to connect to cib. Starting corosync by itself appears to work correctly on both - the logs show they see each other, no errors. If on www4 I start attrd and cib manually (as root), they do run, and crm then manages to connect but reports no nodes. crm on www5 sees www4, but it's marked as 'pending'. pcmk on www4 logs that it can see www5. Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] libpthread segfaults
On 29 Feb 2012, at 21:03, Florian Haas wrote: And you're sure you've got a healthy Corosync membership? corosync-cfgtool -s shows all rings healthy? corosync-objctl | grep member shows 2 members? I'm not sure what the output is supposed to look like, but it certainly gives the impression of being healthy: On www4: Printing ring status. Local node ID 192961885 RING ID 0 id = 192.168.0.11 status = ring 0 active with no faults On www5: Printing ring status. Local node ID 343956829 RING ID 0 id = 192.168.0.148 status = ring 0 active with no faults Are they each meant to show both nodes here? On both nodes: runtime.totem.pg.mrp.srp.members.343956829.ip=r(0) ip(192.168.0.148) runtime.totem.pg.mrp.srp.members.343956829.join_count=1 runtime.totem.pg.mrp.srp.members.343956829.status=joined runtime.totem.pg.mrp.srp.members.192961885.ip=r(0) ip(192.168.0.11) runtime.totem.pg.mrp.srp.members.192961885.join_count=1 runtime.totem.pg.mrp.srp.members.192961885.status=joined But crm status gives this on www4 (this is still running my manually launched cib/attrd): Last updated: Wed Feb 29 20:14:08 2012 Last change: Wed Feb 29 17:34:55 2012 Current DC: NONE 0 Nodes configured, unknown expected votes 0 Resources configured. and this on www5 Last updated: Wed Feb 29 20:14:01 2012 Last change: Wed Feb 29 17:29:20 2012 via crmd on www5 Stack: openais Current DC: www5 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 0 Resources configured. Node www4: pending Online: [ www5 ] Any the wiser? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] MMM conflict with Pacemaker
I have 5 servers where 2 are running a redundant web front-end with pacemaker (managing a single floating IP), two are running MySQL with mmm agents and the last one is running the mmm monitor node. So at present there is no overlap between these groups. I need to retire one of the web servers and its functions will be moved to the machine currently doing mmm monitoring. Easier said than done. If I install pacemaker (from the linux-ha PPA for Lucid, with empty initial config, as per the docs) and start its corosync service, mmm's monitor goes nuts, loses connectivity to agents causes them to drops their floating IP (even though it's not on the machines involved with pacemaker). I can appreciate that there is some overlap in functionality, but I don't see why it should conflict like this. Anyone got an explanation? Is anyone else running this combo? I've temporarily bypassed the front-end so I can work on this, so I'm clear to start entirely from scratch. This is proving difficult too, since the shifting terminology means documentation is mostly out of sync - of the three guides I've tried so far, one doesn't mention ha.cf at all (others do, but with obsolete options), one suggests doing everything with corosync (though appears to be missing any config for pacemaker). One thing that would be very helpful is something to explain the relative merits of ucast, bcast and mcast options, as I suspect they may be part of the problem I'm seeing with mmm. (and I'm not looking to switch to DRBD!) Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] MMM conflict with Pacemaker
On 16 Feb 2012, at 18:00, Mark Grennan wrote: Yes HA systems are very confusing. It's not so much that - it's more that heartbeat/crm/pacemaker/corosync is confusing, not least because it keeps changing its name. Constant changing of names, nomenclature and config settings guarantees that any articles written about it won't work for long. Pacemaker is the name of an older application. Corasync is it's new name but some of the files still maintain the old name. Huh? So why does corosync need setting up to work with pacemaker if it is now pacemaker? Even your doc installs them (and heartbeat) from separate packages! One Issue I can think of is, Pacemaker wants to bind the floating IP as eth#:#, while MMM wants to use a different method that can only be seen with the IP command. I think they are fighting over who owns the floating IP. But pacemaker isn't even running on the machines the mmm float is on! It's somehow interfering with the monitoring node, not the float that it's managing. I don't have a problem with using the ip command - I was under the impression it's how things are supposed to be done now? I've seen mixtures of ifconfig-style network config coexisting quite happily with ip-style ones before. My original config: server1: pacemaker server2: pacemaker server3: mmm monitor server4: mmm agent server5: mmm agent There is a floating IP on servers 1 and 2, and another one on servers 4 and 5. What I want to change to: server2: pacemaker server3: pacemaker + mmm monitor server4: mmm agent server5: mmm agent Here there is a floating IP on 2 and 3, and another on 4 and 5. I don't see any reason they should conflict since there is no overlap of machines that floats are on. What seems to happen is that as soon as corosync is started, the mmm monitor can no longer see the network at all. I suspect this could be something to do with the suggested setting of using the network address for bindnetaddr in corosync. I'm still mystified by whether I should use ucast, mcast or bcast - previous setups I've done with crm have used ucast. I see in your example you're binding to a private IP for corosync, but I can't understand why you're using a public IP for mcast, or why it's even there at all. Your guide wasn't one of the ones I'd found, so thanks for the pointer. The most interesting one for me was this one, since it is closest to my own config and seems quite recent (i.e. it even mentions corosync): https://wiki.ubuntu.com/ClusterStack/LucidTesting The official 'cluster from scratch' PDF skips over quite a few bits of vital info, so I found I couldn't really use it. My mmm config was originally installed by Percona, and I've done several others since. mmm has always worked beautifully for me (even through multiple hardware and network failures), and the main complaint I've seen about it (1062 errors) is nothing to do with mmm. I fully understand that it has problems, however it has the advantage of being very stable and trivially easy to understand and configure. While I keep reading good things about pacemaker, the practical aspects of getting it to work have always turned into a yak-shaving festival, so I've always been put off pursuing it for anything beyond management of a single IP. One critical aspect of an HA system is that it should be really easy to deal with when things go wrong; I'd put xtrabackup in this category - it's great (though I hope you have automated tests for your restores as it went through a patch late last year when they were broken!). Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Basic 2-node floating IP setup
On 2 Dec 2009, at 15:18, David Lang wrote: I thought you needed a different node line for each server node www1 node www2 Docs say you can: http://www.linux-ha.org/ha.cf#node unless you have a specific reason to allow the two IP addresses to exist on two different machines you should put them both in the same line www1 182.158.1.3 192.158.1.4 OK. I've added this to my /etc/sysctl.conf, which was apparently necessary to allow the floating IPs to exist: net.ipv4.ip_nonlocal_bind=1 this isn't needed to let floating IPs exist, but it is needed to let software startup that wants to use these IP addresses when the box isn't active. Ah, ok, I knew it was something like that. heartbeat didn't re-read config with a simple restart, needed a full stop/start. After doing this, both floating IPs are up and it seems to be working! Thanks for your help, Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK resellers of i...@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Basic 2-node floating IP setup
I'm trying to set up HA on an ubuntu cluster. The docs are fairly comprehensive, but I still can't make sufficient sense out of them or find any sufficiently matching examples. My scenario is pretty standard - two nodes running pound and HA managing two floating IPs between the two. I'm using the stock Ubuntu 2.1.4 package with an HA 1.0 style config. The main thing I can't quite figure out is exactly what to do with the floating IPs. As far as I can see, the floating IPs are purely resources to be shared by my nodes and as such are not nodes themselves, and don't need to appear in ha.cf. Having said that, I have use of another HA config done by someone else that lists a floating IP as a node and ucast, and it works fine, which has me confused. Say that my two nodes are 192.158.1.1 and 192.168.1.2, call them www1 and www2, and my floating IPs are 192.168.1.3 and 192.168.1.4. So far my ha.cf looks like this: node www1 www2 ucast eth0 192.158.1.1 ucast eth0 192.158.1.2 deadtime 5 deadping 5 debug 0 and haresources looks like this: www1 192.158.1.3 www1 192.158.1.4 I've added this to my /etc/sysctl.conf, which was apparently necessary to allow the floating IPs to exist: net.ipv4.ip_nonlocal_bind=1 Does all that look right? Anything I've missed? Do the floating IPs need to appear in ha.cf as ucast lines (like my other setup)? Is my other setup wrong? Thanks, Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK resellers of i...@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems