Re: [Linux-ha-dev] [Openais] An OCF agent for LXC (Linux Containers)
On Tue, Apr 26, 2011 at 03:36:35PM +0200, Florian Haas wrote: Thanks Darren! Thanks for the contribution! Can I suggest - we move this discussion to the linux-ha-dev list (where most OCF RA related discussions and reviews take place); - you give the RA a makeover following the OCF RA developer's guide (http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html); - you set up your own github fork off of https://github.com/ClusterLabs/resource-agents, and push your RA to that so we can eventually pull it into the mainline repo? Also, can you explain what the advantages of your approach are, versus using libvirt-managed lxc containers which Pacemaker can tie into via the existing VirtualDomain agent? Yes, this is the first thing I thought about too. A few remarks: - the required attributes in meta-data need to be reviewed, a parameter is either required or has a default, cannot be both - why use screen(1) in start? BTW, since lxc seems to be easy to setup, it would be great to supply an ocft test file along with the RA. It's quite straightforward, just make a copy of one of the existing test files from tools/ocft. Cheers, Dejan Thanks! Cheers, Florian ___ Openais mailing list open...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] STONITH plugin for VMware vCenter
Hi, On Thu, Apr 21, 2011 at 11:08:08AM +0200, Nhan Ngo Dinh wrote: Hi, On Tue, 2011-04-19 at 14:21 +0200, Dejan Muhamedagic wrote: longdesc lang=en The VMware vCenter address (default: localhost) The defaults should go into the content element (see other stonith plugins, e.g. external/ipmi). These defaults come from the vSphere Perl SDK, they are not handled inside this code. Does it make any difference? Anyway I've changed as you said. Enable/disable a PowerOnVM on reset when the target virtual machine is off Allowed values: 0, 1 This should default to 1. For better or worse, that's what stonith prescribes and other plugins adhere to. Ok. I've also added an error if RESETPOWERON is set and machine is powered off. OK. Is this the only error which can happen? If not, then no error will be logged in that case. Ditto for another occurence below. This is what happens according to SDK, however I've added also a generic error handling procedure to die() if anything other fails. Good. One (probably) never knows future. I'll push the plugin now to the public repository. We just need one more thing to fix. The info commands such as getinfo-xml have to work without software which would otherwise be required for the plugin's operation, in this case it's the VMware::VIRuntime module. I guess that you need to use the eval command. Many thanks for the contribution. Not least for the documentation! Cheers, Dejan Best regards, Nhan ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Cluster Stack - Ubuntu Developer Summit
Greetings Everyone! In a couple weeks from today, the Ubuntu Developer Summit (UDS) will be kicking in. UDS's, are the events on which discussion happens towards the next Ubuntu Release in various aspects, such as Server, Desktop, Foundations, etc. In this case, the UDS will be held in Budapest, Hungary between 9-13 of May, towards Ubuntu 11.10 Oneiric Ocelot, due in October 2011. As it has become a custom from past UDS', this time we will also have a session for the Cluster Stack. This session will have as primary objective the discussion of the adaption of Pacemaker 1.1.X and related software as a technology preview for what's yet to come, and in preparation for the next Ubuntu release 12.04, which is a Long Term Support release, or discuss the upgrade-path options to partially upgrade some components while leaving others (i.e. fence-agents, resource-agents). Additionally, we also discuss some other features of things we would like to see for the Cluster Stack in Ubuntu, as well as spread is adoption, improve documentation, etc. UDS' are open-to-public events, and I believe it would be great if upstream could participate and maybe even further the discussion about the Cluster Stack. For more information about UDS, please visit [1]. The specific date/time for the Cluster Stack session is not yet available. If you require any further information please don't hesitate to contact me. [1]: http://uds.ubuntu.com/ -- Andres Rodriguez (RoAkSoAx) Ubuntu Server Developer Systems Engineer ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Bug in crm shell or pengine
Hi, On Tue, Apr 19, 2011 at 09:22:41AM -0600, Serge Dubrouski wrote: On Tue, Apr 19, 2011 at 1:12 AM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote: Ok, I've read the documentation. It's not a bug, it's a feature :-) Might be nice if the shell could somehow prevent such configs, but it would be non-trivial to implement. Or may be as trivial as checking for such duplicates and in case of different roles adjusting interval time with plus or minus 1. A good idea. Could you please file a bugzilla lest we forget about it. Thanks, Dejan On Mon, Apr 18, 2011 at 3:01 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - Looks like there is a bug in crm shell Pacemaker version 1.1.5 or in pengine. primitive pg_drbd ocf:linbit:drbd \ params drbd_resource=drbd0 \ op monitor interval=60s role=Master timeout=10s \ op monitor interval=60s role=Slave timeout=10s Log file: Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 crmd: [5535]: info: do_state_transition: Starting PEngine Recheck Timer Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Apr 17 04:05:29 cs51 pengine: [5534]: ERROR: is_op_dup: Operation pg_drbd-monitor-60s-0 is a duplicate of pg_drbd-monitor-60s Plus strange behavior of the cluster like inability to mover resources from one node to another. -- Serge Dubrouski. -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Pacemaker] Resource Agents 1.0.4: HA LVM Patch
Hi, On Tue, Apr 19, 2011 at 03:56:16PM +0200, Ulf wrote: Hi, I attached a patch to enhance the LVM agent with the capability to set a tag on the VG (set_hosttag = true) in conjunction with a volume_list filter this can prevent to activate a VG on multiple host. Unfortunately active VGs will stay active in case of unclean operation. Can you please elaborate on the benefits this patch would bring. Is it supposed to prevent a VG from being mounted on more than one node? Looking at the code, it seems like that on the start operation, the existing tag would be overwritten regardless. Thanks, Dejan P.S. Moving the discussion to the proper mailing list. The tag is always the hostname. Some configuration hints can be found here: http://sources.redhat.com/cluster/wiki/LVMFailover Cheers, Ulf -- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl ___ Pacemaker mailing list: pacema...@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Bug in crm shell or pengine
On Tue, Apr 26, 2011 at 1:03 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Tue, Apr 19, 2011 at 09:22:41AM -0600, Serge Dubrouski wrote: On Tue, Apr 19, 2011 at 1:12 AM, Andrew Beekhof and...@beekhof.net wrote: On Mon, Apr 18, 2011 at 11:38 PM, Serge Dubrouski serge...@gmail.com wrote: Ok, I've read the documentation. It's not a bug, it's a feature :-) Might be nice if the shell could somehow prevent such configs, but it would be non-trivial to implement. Or may be as trivial as checking for such duplicates and in case of different roles adjusting interval time with plus or minus 1. A good idea. Could you please file a bugzilla lest we forget about it. Bug 2586. -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Interested in Contributing
On 4/23/2011 at 03:55 AM, Michael Thrift mike.thr...@schryvermedical.com wrote: All, I've recently started diving into Linux-HA, and I must say I am very impressed. Welcome! I'm developing some in house HA solutions, leveraging the Linux-HA project, and it's going very well. One of the projects I've been working on is Squid HA, and I found that the OCF script was a little too limited for our deployment of multiple Squid instances on the same box. I've modified the OCF to include a new OCF_RESKEY named squid_address. This allowed the script to check the health of a specific squid instance, rather then just check if Squid is running in general. I'd like to contribute this to the project, but I'm not sure the best place to do so... Any thoughts on this? I'm happy to share my mods to the OCF script for those who are interested. Thanks! Try the linux-ha-dev list for RA patches/tweaks/contributions/etc. Regards, Tim -- Tim Serong tser...@novell.com Senior Clustering Engineer, OPS Engineering, Novell Inc. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] [Heartbeat] my VIP doesn't work :(
Hi all, First I'm french so sorry in advance for my English... I have to use Heartbeat for High Availability between 2 Tomcat 5.5 servers under Linux RedHat 5.3. The first server is active, the other one is passive. The master is called servappli01, with IP address 186.20.100.40, the slave is called servappli02, with IP address 186.20.100.39. I configured a virtual IP 186.20.100.41. Each Tomcat is not launched when server is started, this is Heartbeat which starts Tomcat when it's running. My problem is : When heartbeat is started on the first server, then on the second server, the VIP is assigned to the 2 servers ! also, Tomcat is started on each server, and each node see the other node as dead ! Here is my configuration : ha.cf file (the same on each server) : logfile /var/log/ha-log debugfile /var/log/ha-debug logfacility none keepalive 2 warntime 6 deadtime 10 initdead 90 bcast eth0 node servappli01 servappli02 auto_failback yes respawn hacluster /usr/lib/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster haresources file (the same on each server) : servappli01 IPaddr::186.20.100.41/24/eth0 tomcat Result of ifconfig command on the first server (servappli01) : eth0 Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.40 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:febb:c238/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14404996 errors:0 dropped:0 overruns:0 frame:0 TX packets:6580505 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:385833 (3.5 GiB) TX bytes:2694953468 (2.5 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of ifconfig command on the second server (servappli02) at the same time : eth0 Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.39 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:fe77:c90c/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23815049 errors:0 dropped:0 overruns:0 frame:0 TX packets:17441845 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:2620027933 (2.4 GiB) TX bytes:3595896739 (3.3 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of /usr/bin/cl_status listnodes command (on each server) : servappli02 servappli01 Result of /usr/bin/cl_status nodestatus servappli01 command on servappli01 : active Result of /usr/bin/cl_status nodestatus servappli02 command on servappli01 : dead Result of /usr/bin/cl_status nodestatus servappli01 command on servappli02 : dead Result of /usr/bin/cl_status nodestatus servappli02 command on servappli02 : active And of course, if I kill Tomcat on master server, there's no switch to the second server (a call to a webapp using the VIP doesn't work). Can somebody help me please ? I guess there's is something wrong but I don't know what ! Thanx Mathieu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] mgmtd: [xxx]: ERROR: on_listen attach server socket failed
Hi, We are getting mgmtd: [xxx]: ERROR: on_listen attach server socket failed errors in logs time to time. Any idea what it means and what is the cause ? Cluster looks OK though. Thanks, ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] XEN NPIV with Brocade bfa driver anyone?
Hi! I just found out that XEN4's NPIV (Fibre Channel NPort Virtualozation) does not work with Brocade's bfa driver in SLES 11 SP1. That is because of non-standard sysfs entries being used for virtual ports (similar to Emulex, but still different). I wonder whether anybody did hack the block-npiv-common to make that work. Sorry if that's not very closely related to HA, but when wanting to move VMs, virtual ports are quite nice to have... Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Heartbeat] my VIP doesn't work :(
On 11-04-22 06:25 AM, SEILLIER Mathieu wrote: Hi all, First I'm french so sorry in advance for my English... I have to use Heartbeat for High Availability between 2 Tomcat 5.5 servers under Linux RedHat 5.3. The first server is active, the other one is passive. The master is called servappli01, with IP address 186.20.100.40, the slave is called servappli02, with IP address 186.20.100.39. I configured a virtual IP 186.20.100.41. Each Tomcat is not launched when server is started, this is Heartbeat which starts Tomcat when it's running. My problem is : When heartbeat is started on the first server, then on the second server, the VIP is assigned to the 2 servers ! also, Tomcat is started on each server, and each node see the other node as dead ! Here is my configuration : ha.cf file (the same on each server) : logfile /var/log/ha-log debugfile /var/log/ha-debug logfacility none keepalive 2 warntime 6 deadtime 10 initdead 90 bcast eth0 node servappli01 servappli02 auto_failback yes respawn hacluster /usr/lib/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster haresources file (the same on each server) : servappli01 IPaddr::186.20.100.41/24/eth0 tomcat Result of ifconfig command on the first server (servappli01) : eth0 Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.40 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:febb:c238/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14404996 errors:0 dropped:0 overruns:0 frame:0 TX packets:6580505 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:385833 (3.5 GiB) TX bytes:2694953468 (2.5 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of ifconfig command on the second server (servappli02) at the same time : eth0 Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.39 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:fe77:c90c/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23815049 errors:0 dropped:0 overruns:0 frame:0 TX packets:17441845 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:2620027933 (2.4 GiB) TX bytes:3595896739 (3.3 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of /usr/bin/cl_status listnodes command (on each server) : servappli02 servappli01 Result of /usr/bin/cl_status nodestatus servappli01 command on servappli01 : active Result of /usr/bin/cl_status nodestatus servappli02 command on servappli01 : dead Result of /usr/bin/cl_status nodestatus servappli01 command on servappli02 : dead Result of /usr/bin/cl_status nodestatus servappli02 command on servappli02 : active And of course, if I kill Tomcat on master server, there's no switch to the second server (a call to a webapp using the VIP doesn't work). Can somebody help me please ? I guess there's is something wrong but I don't know what ! Thanx Mathieu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems It almost sounds like the nodes are unaware of each other. Could be a network thing maybe. Here's some things to try: Can you ssh or ping one node from the other? Bring up one node with the VIP running - leave the other node up but heartbeat down. an you ping the VIP from the node NOT running HA? What happens when you look at the cluster when both nodes are running - use the crm_mon command and paste what you see in here. I'm thinking you have some sort of network issue. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Heartbeat] my VIP doesn't work :(
Have you generated the authkey by corosync-keygen command on one node then copied that file to other node ? -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of mike Sent: Tuesday, April 26, 2011 5:41 PM To: General Linux-HA mailing list Subject: Re: [Linux-HA] [Heartbeat] my VIP doesn't work :( On 11-04-22 06:25 AM, SEILLIER Mathieu wrote: Hi all, First I'm french so sorry in advance for my English... I have to use Heartbeat for High Availability between 2 Tomcat 5.5 servers under Linux RedHat 5.3. The first server is active, the other one is passive. The master is called servappli01, with IP address 186.20.100.40, the slave is called servappli02, with IP address 186.20.100.39. I configured a virtual IP 186.20.100.41. Each Tomcat is not launched when server is started, this is Heartbeat which starts Tomcat when it's running. My problem is : When heartbeat is started on the first server, then on the second server, the VIP is assigned to the 2 servers ! also, Tomcat is started on each server, and each node see the other node as dead ! Here is my configuration : ha.cf file (the same on each server) : logfile /var/log/ha-log debugfile /var/log/ha-debug logfacility none keepalive 2 warntime 6 deadtime 10 initdead 90 bcast eth0 node servappli01 servappli02 auto_failback yes respawn hacluster /usr/lib/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster haresources file (the same on each server) : servappli01 IPaddr::186.20.100.41/24/eth0 tomcat Result of ifconfig command on the first server (servappli01) : eth0 Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.40 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:febb:c238/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14404996 errors:0 dropped:0 overruns:0 frame:0 TX packets:6580505 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:385833 (3.5 GiB) TX bytes:2694953468 (2.5 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of ifconfig command on the second server (servappli02) at the same time : eth0 Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.39 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:fe77:c90c/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23815049 errors:0 dropped:0 overruns:0 frame:0 TX packets:17441845 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:2620027933 (2.4 GiB) TX bytes:3595896739 (3.3 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of /usr/bin/cl_status listnodes command (on each server) : servappli02 servappli01 Result of /usr/bin/cl_status nodestatus servappli01 command on servappli01 : active Result of /usr/bin/cl_status nodestatus servappli02 command on servappli01 : dead Result of /usr/bin/cl_status nodestatus servappli01 command on servappli02 : dead Result of /usr/bin/cl_status nodestatus servappli02 command on servappli02 : active And of course, if I kill Tomcat on master server, there's no switch to the second server (a call to a webapp using the VIP doesn't work). Can somebody help me please ? I guess there's is something wrong but I don't know what ! Thanx Mathieu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems It almost sounds like the nodes are unaware of each other. Could be a network thing maybe. Here's some things to try: Can you ssh or ping one node from the other? Bring up one node with the VIP running - leave the other node up but heartbeat down. an you ping the VIP from the node NOT running HA? What happens when you look at the cluster when both nodes are running - use the crm_mon command and paste what you see in here. I'm thinking you have some sort of network issue. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also:
Re: [Linux-HA] [Heartbeat] my VIP doesn't work :(
On 4/22/2011 4:25 AM, SEILLIER Mathieu wrote: Result of /usr/bin/cl_status nodestatus servappli01 command on servappli01 : active Result of /usr/bin/cl_status nodestatus servappli02 command on servappli01 : dead Result of /usr/bin/cl_status nodestatus servappli01 command on servappli02 : dead Result of /usr/bin/cl_status nodestatus servappli02 command on servappli02 : active iptables? And of course, if I kill Tomcat on master server, there's no switch to the second server (a call to a webapp using the VIP doesn't work). You need mon for that. Dima ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Heartbeat] my VIP doesn't work :(
On Tue, Apr 26, 2011 at 12:29:28PM +, Amit Jathar wrote: Have you generated the authkey by corosync-keygen command on one node then copied that file to other node ? Heartbeat != Corosync Thanks, Dejan -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of mike Sent: Tuesday, April 26, 2011 5:41 PM To: General Linux-HA mailing list Subject: Re: [Linux-HA] [Heartbeat] my VIP doesn't work :( On 11-04-22 06:25 AM, SEILLIER Mathieu wrote: Hi all, First I'm french so sorry in advance for my English... I have to use Heartbeat for High Availability between 2 Tomcat 5.5 servers under Linux RedHat 5.3. The first server is active, the other one is passive. The master is called servappli01, with IP address 186.20.100.40, the slave is called servappli02, with IP address 186.20.100.39. I configured a virtual IP 186.20.100.41. Each Tomcat is not launched when server is started, this is Heartbeat which starts Tomcat when it's running. My problem is : When heartbeat is started on the first server, then on the second server, the VIP is assigned to the 2 servers ! also, Tomcat is started on each server, and each node see the other node as dead ! Here is my configuration : ha.cf file (the same on each server) : logfile /var/log/ha-log debugfile /var/log/ha-debug logfacility none keepalive 2 warntime 6 deadtime 10 initdead 90 bcast eth0 node servappli01 servappli02 auto_failback yes respawn hacluster /usr/lib/heartbeat/ipfail apiauth ipfail gid=haclient uid=hacluster haresources file (the same on each server) : servappli01 IPaddr::186.20.100.41/24/eth0 tomcat Result of ifconfig command on the first server (servappli01) : eth0 Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.40 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:febb:c238/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:14404996 errors:0 dropped:0 overruns:0 frame:0 TX packets:6580505 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:385833 (3.5 GiB) TX bytes:2694953468 (2.5 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:BB:C2:38 inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of ifconfig command on the second server (servappli02) at the same time : eth0 Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.39 Bcast:186.20.100.255 Masque:255.255.255.0 adr inet6: fe80::21e:bff:fe77:c90c/64 Scope:Lien UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:23815049 errors:0 dropped:0 overruns:0 frame:0 TX packets:17441845 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:2620027933 (2.4 GiB) TX bytes:3595896739 (3.3 GiB) Interruption:177 Memoire:fa00-fa012100 eth0:0Link encap:Ethernet HWaddr 00:1E:0B:77:C9:0C inet adr:186.20.100.41 Bcast:186.20.100.255 Masque:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Interruption:177 Memoire:fa00-fa012100 Result of /usr/bin/cl_status listnodes command (on each server) : servappli02 servappli01 Result of /usr/bin/cl_status nodestatus servappli01 command on servappli01 : active Result of /usr/bin/cl_status nodestatus servappli02 command on servappli01 : dead Result of /usr/bin/cl_status nodestatus servappli01 command on servappli02 : dead Result of /usr/bin/cl_status nodestatus servappli02 command on servappli02 : active And of course, if I kill Tomcat on master server, there's no switch to the second server (a call to a webapp using the VIP doesn't work). Can somebody help me please ? I guess there's is something wrong but I don't know what ! Thanx Mathieu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems It almost sounds like the nodes are unaware of each other. Could be a network thing maybe. Here's some things to try: Can you ssh or ping one node from the other? Bring up one node with the VIP running - leave the other node up but heartbeat down. an you ping the VIP from the node NOT running HA? What happens when you look at the cluster when both nodes are running
Re: [Linux-HA] JAVA sun.jnu.encoding ignored when process started from BP. Not when started manually
Hi, On Thu, Apr 21, 2011 at 05:45:37PM -0500, Mike Toler wrote: I have a java process that I am started by Linux HA. I have create an OCF script called BillingProcessor. That script calls an outside script (pm.pl) which starts the process. The JAVA command is shown here. Note, I am including the -Dsun.jnu.encoding=UTF-8 directive. java -Dsun.jnu.encoding=UTF-8 -cp ../lib/RSCBillingProcessor.jar:../lib/RSCBillingCollector.jar:../lib/fw_ alarms.jar:../lib/fw_app.jar:../lib/fw_base.jar:../lib/fw_comm.jar:../li b/fw_config.jar:../lib/fw_dom4j.jar:../lib/fw_file.jar:../lib/fw_jdom.ja r:../lib/fw_metaobject.jar:../lib/fw_staged.jar:../lib/fw_stats.jar:../l ib/fw_util.jar:../lib/fw_xmpp.jar:../lib/3rdPartyLib/log4j.jar:../lib/3r dPartyLib/jdom.jar:../lib/3rdPartyLib/jcore.jar:../lib/3rdPartyLib/commo ns-cli-1.0.jar:../lib/3rdPartyLib/snmp.jar:../config com.prodeasystems.rsc.bc.processor.BillingProcApp BP ../config/BP.xml When I start the process using my script alone, I see: sun.jnu.encoding = UTF-8 When it is started from Heartbeat, I see: sun.jnu.encoding = ANSI_X3.4-1968 See how? In the java process? Or with ps(1)? The latter would be really strange. Otherwise, I cannot say what's going on. You can try to debug the script, just add 'set -x' somewhere and the trace will be logged. Perhaps also to dump the environment just before invoking java. Redirect to stderr (set 2), stdout is logged only with debug on. Or do exec 2 at the top of the script. Thanks, Dejan I can't for the life of me figure out HOW heartbeat can be causing this, but it is 100% consistent over 4 installations on 3 OS's (Redhat 5.4, Centos 5.4 and Centos 6.0). The process started from the command line has encoding of UTF-8. The process started from heartbeat has ANSI_X3.4-1968. Has anyone ever seen anything like this?? Michael Toler (214) 278-1834 (Office) (972) 816-7790 (mobile) Senior Systems Integration Engineer Prodea Systems. This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker not reconnecting
Hi, On Thu, Apr 21, 2011 at 04:52:23PM +0200, Jean-Baptiste GIRARD wrote: Hi, I have a two node cluster with pacemaker (heartbeat). Regularly after a cluster partition there is a problem with the membership. Both nodes see the other one offline and you can see the following log: Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: CRIT: Cluster node acsdupli-s returning after partition. Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: info: For information on cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: WARN: Deadtime value may be too small. Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: info: See FAQ for information on tuning deadtime. Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: info: URL: http://linux-ha.org/wiki/FAQ#Heavy_Load Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: info: Link acsdupli-s:eth0 up. Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: WARN: Late heartbeat: Node acsdupli-s: interval 75040 ms Apr 20 14:13:18 ACSDUPLI-M heartbeat: [13151]: info: Status update for node acsdupli-s: status active Apr 20 14:13:18 ACSDUPLI-M cib: [13168]: WARN: cib_peer_callback: Discarding cib_apply_diff message (2ce2b) from acsdupli-s: not in our membership Apr 20 14:13:18 ACSDUPLI-M pingd: [13173]: notice: pingd_lstatus_callback: Status update: Ping node acsdupli-s now has status [up] Apr 20 14:13:18 ACSDUPLI-M pingd: [13173]: notice: pingd_nstatus_callback: Status update: Ping node acsdupli-s now has status [up] Apr 20 14:13:18 ACSDUPLI-M pingd: [13173]: notice: pingd_nstatus_callback: Status update: Ping node acsdupli-s now has status [active] Apr 20 14:13:18 ACSDUPLI-M crmd: [13172]: notice: crmd_ha_status_callback: Status update: Node acsdupli-s now has status [active] (DC=true) Apr 20 14:13:18 ACSDUPLI-M crmd: [13172]: info: crm_update_peer_proc: acsdupli-s.ais is now online Apr 20 14:13:19 ACSDUPLI-M ccm: [13167]: info: Break tie for 2 nodes cluster Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: no mbr_track info Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: instance=91, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: mem_handle_event: no mbr_track info Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=91) Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: mem_handle_event: instance=91, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=91) Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: ccm_event_detail: NEW MEMBERSHIP: trans=91, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3 Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: ccm_event_detail: CURRENT: acsdupli-m [nodeid=0, born=91] Apr 20 14:13:19 ACSDUPLI-M crmd: [13172]: info: populate_cib_nodes_ha: Requesting the list of configured nodes Apr 20 14:13:19 ACSDUPLI-M ccm: [13167]: info: Break tie for 2 nodes cluster Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: no mbr_track info Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: mem_handle_event: instance=92, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Apr 20 14:13:19 ACSDUPLI-M cib: [13168]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=92) Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/1001, version=0.185.2): ok (rc=0) Apr 20 14:13:20 ACSDUPLI-M ccm: [13167]: info: Break tie for 2 nodes cluster Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: mem_handle_event: no mbr_track info Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: mem_handle_event: instance=93, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 Apr 20 14:13:20 ACSDUPLI-M cib: [13168]: info: cib_ccm_msg_callback: Processing CCM event=NEW
Re: [Linux-HA] Pacemaker installation errors
Hi, On Thu, Apr 21, 2011 at 11:28:26AM +, Amit Jathar wrote: Hi , I tried to install pacemaker. While installing 'Resource Agents', I run make command and got attached errors. I tried twice (did make clean also) and on both occasions, error was bit different (as attached). The steps I was performing was :- wget -O resource-agents.tar.bz2 http://hg.linux-ha.org/agents/archive/tip.tar.bz2 tar jxvf resource-agents.tar.bz2 cd Cluster-Resource-Agents-* ./autogen.sh ./configure --prefix=$PREFIX make sudo make install I am using CenOS 5.6 64-bit . Try rpmbuild? Or just 'make rpm'? Thanks, Dejan Or can I use the Pacemaker with this erred source ? Thanks, Amit This email (message and any attachment) is confidential and may be privileged. If you are not certain that you are the intended recipient, please notify the sender immediately by replying to this message, and delete all copies of this message and attachments. Any other use of this email by you is prohibited. Content-Description: ResAgent_make_error1.txt Note: Writing ocf_heartbeat_ClusterMon.7 OCF_ROOT=. OCF_FUNCTIONS_DIR=../heartbeat ../heartbeat/CTDB meta-data metadata-CTDB.xml /usr/bin/xsltproc --novalid \ --stringparam package resource-agents \ --stringparam version 1.0.4 \ --output ocf_heartbeat_CTDB.xml \ ra2refentry.xsl metadata-CTDB.xml /usr/bin/xsltproc \ --xinclude \ http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl ocf_heartbeat_CTDB.xml Note: Writing ocf_heartbeat_CTDB.7 OCF_ROOT=. OCF_FUNCTIONS_DIR=../heartbeat ../heartbeat/Delay meta-data metadata-Delay.xml /usr/bin/xsltproc --novalid \ --stringparam package resource-agents \ --stringparam version 1.0.4 \ --output ocf_heartbeat_Delay.xml \ ra2refentry.xsl metadata-Delay.xml /usr/bin/xsltproc \ --xinclude \ http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl ocf_heartbeat_Delay.xml Note: Writing ocf_heartbeat_Delay.7 OCF_ROOT=. OCF_FUNCTIONS_DIR=../heartbeat ../heartbeat/Dummy meta-data metadata-Dummy.xml /usr/bin/xsltproc --novalid \ --stringparam package resource-agents \ --stringparam version 1.0.4 \ --output ocf_heartbeat_Dummy.xml \ ra2refentry.xsl metadata-Dummy.xml /usr/bin/xsltproc \ --xinclude \ http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl ocf_heartbeat_Dummy.xml http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: or ' expected ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : xmlParseEntityDecl: entity list.class not terminated ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : XML conditional section not closed ^ unable to parse ocf_heartbeat_Dummy.xml gmake[1]: *** [ocf_heartbeat_Dummy.7] Error 6 rm metadata-CTDB.xml metadata-Delay.xml metadata-Dummy.xml metadata-ClusterMon.xml metadata-AudibleAlarm.xml gmake[1]: Leaving directory `/usr/local/src/Cluster-Resource-Agents-7a11934b142d/doc' make: *** [all-recursive] Error 1 Content-Description: ResAgent_make_error2.txt gmake[1]: Entering directory `/usr/local/src/Cluster-Resource-Agents-7a11934b142d/doc' OCF_ROOT=. OCF_FUNCTIONS_DIR=../heartbeat ../heartbeat/AoEtarget meta-data metadata-AoEtarget.xml /usr/bin/xsltproc --novalid \ --stringparam package resource-agents \ --stringparam version 1.0.4 \ --output ocf_heartbeat_AoEtarget.xml \ ra2refentry.xsl metadata-AoEtarget.xml /usr/bin/xsltproc \ --xinclude \ http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl ocf_heartbeat_AoEtarget.xml http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue: '%' forbidden except for entities references ^ http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd:171: parser error : EntityValue:
Re: [Linux-HA] UDP / DHCP / LDIRECTORD
Hi Simon, Is there any way we can coerce you to offer us some assistance with this again? I'm sure you are a very busy person but any help you can offer would be appreciated, also if you need any further information from me that would be greatly appreciated. Brian Carpio Senior Systems Engineer Office: +1.303.962.7242 Mobile: +1.720.319.8617 Email: bcar...@broadhop.com -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Brian Carpio Sent: Monday, April 25, 2011 4:30 AM To: General Linux-HA mailing list; 'Simon Horman' Cc: 'lvs-devel'; 'Julian Anastasov' Subject: Re: [Linux-HA] UDP / DHCP / LDIRECTORD Hi, It looks like there also might be a memory leak in this patch.. over the last few months we have seen memory grow slowly but lately the traffic has increased and the memory utilization of the Linux box is now growing faster. I put in a few scripts to try and detect where this memory leak was coming from and when watching /proc/meminfo over the last few days I saw that slab was growing. So I put in a new script to watch slabtop and I can see that ip_vs_conn is growing. The number of SLABS just grows and grows, and so does the CACHE_SIZE. Is there any way you have a chance to look into this for us? Any additional information I can give to you about this problem? Thanks a lot, Brian Carpio -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Brian Carpio Sent: Friday, February 25, 2011 12:14 PM To: General Linux-HA mailing list; 'Simon Horman' Cc: 'lvs-devel'; 'Julian Anastasov' Subject: Re: [Linux-HA] UDP / DHCP / LDIRECTORD Apparently this is related to some sort of race condition (possibly a problem with my ldirectord start script which does an edit on the ipvsadm config after ldirectord has started) if ldirectord starts to receive traffic on port 67/68 before the following commands are run: ipvsadm -E -u 10.10.10.10:67 -o -s rr ipvsadm -E -u 10.10.10.10:68 -o -s rr Then it will be stuck sending traffic to the fist server in the list. Brian Carpio Senior Systems Engineer Office: +1.303.962.7242 Mobile: +1.720.319.8617 Email: bcar...@broadhop.com -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Brian Carpio Sent: Thursday, February 24, 2011 3:47 PM To: 'Simon Horman' Cc: 'lvs-devel'; 'Julian Anastasov'; 'linux-ha@lists.linux-ha.org' Subject: Re: [Linux-HA] UDP / DHCP / LDIRECTORD All, So this patch has been working for us flawlessly for the last 5 months or so. Our infrastructure is 100% virtualized, the other day our loadbalacner01 had a memory leak and crashed, since we use ldirectord with heartbeat loadbalacner02 took over, however ever since then it seems like the single packet UDP scheduling has stopped working. Even if I fail back over the loadbalacner01 VM, I still see all the DHCP traffic going to only one backend server. If I run ipvsadm -L -n I can see that ipvsadm thinks both of the backend servers are up since the weight is set to 1 for each server, if I reboot the second backend server the one which is not receiving any traffic then run ipvsadm -L -n I can see its weight go to 0 and in the ldirectord log I can see that its marked dead. I have exported one of the loadblancers and one of the backend servers (using VMware) and imported them into another ESXi server, once I boot up the loadbalacner it works perfectly... I'm very stumped why this would happen, is there any additional logging you can think of that I might want to enable to see where the exact problem is? Here are my configs: /etc/ha.d/ldirectord.conf checktimeout=10 checkinterval=2 autoreload=yes logfile=/var/log/ldirectord.log quiescent=yes virtual=10.10.10.10:67 real=backend_server01:67 masq real=backend_server02:67 masq protocol=udp checktype=ping scheduler=rr virtual=10.10.10.10:68 real=back_endserver01:68 masq real=backend_server02:68 masq protocol=udp checktype=ping scheduler=rr I had to rewrite the ldirectord start script and added the following lines in the start and restart sections: ipvsadm -E -u 10.10.10.10:67 -o -s rr ipvsadm -E -u 10.10.10.10:68 -o -s rr Here is the output of ipvsadm -L -n when both backend servers are up (working environment): IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags - RemoteAddress:Port Forward Weight ActiveConn InActConn UDP 10.10.10.10:67 rr ops - backend_server01:67Masq1 0 16731 - backend_server02:67Masq1 0 17447 UDP 192.168.181.67:68 rr ops - backend_server01:68Masq1 0 0 - backend_server02:68Masq1 0
Re: [Linux-HA] Problem using Stonith external/ipmi device
On Tue, Apr 19, 2011 at 02:46:06PM +0200, Andrew Beekhof wrote: On Tue, Apr 19, 2011 at 12:43 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: Hi, On Mon, Apr 11, 2011 at 09:41:12AM +0200, Andrew Beekhof wrote: On Fri, Apr 8, 2011 at 11:07 AM, Matthew Richardson m.richard...@ed.ac.uk wrote: On 07/04/11 16:36, Dejan Muhamedagic wrote: For whatever reason stonith-ng doesn't think that stonithipmidisk1 can manage this node. Which version of Pacemaker do you run? Perhaps this has been fixed in the meantime. I cannot recall right now if there has been such a problem, but it's possible. You can also try to turn debug on and see if there are more clues. I'm using Pacemaker 1.1.5 from the clusterlabs rpm-next repositories on el5. I've tried turning on debug, but there's no more information coming out in the logs. man stonithd has the bits you need. start with pcmk_host_check That defaults to dynamic-list which should query the resource. Right? Right. Apparently, something's not quite ok there. the list command doesn't work perhaps? Yes, it does work. And it's been working since forever, as you know. Unless there's something wrong with the installation. Whatever happened here? Matthew? Thanks, Dejan BTW, I've been doing tests with external/ssh and it did work fine. also fine with fence_xvm ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems