Re: [Linux-HA] fence_apc always fails after some time and resources remains stopped
Il giorno Ven 22 Nov 2013 10:26:08 CET, RaSca ha scritto: [...] After this resources remains in stopped state. Why this happens? Am I in this case: https://github.com/ClusterLabs/pacemaker/pull/334 ? What kind of workaround can I use? Thanks a lot, as usual. I don't know if my problem is the one described in the errata above, but what I found for resolving my problem was to use fence_apc_snmp instead of fence_apc. Since it is using snmp it is faster and it never timeout. All the other questions are still open, so if you want to give a suggestion I am open for the discussion. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] fence_apc always fails after some time and resources remains stopped
Hi there, I'm using Pacemaker 1.1.10 on a Debian cluster of two machines, those are connected to an APC power switch which I can contact via command line in this way: # fence_apc -a ACPADDR -x -l USER -p PASS -n 1 -o status Status: ON And for which I've configured two fence resource in this way: primitive st_fence_scv1 stonith:fence_apc \ params ipaddr=APCADDR login=USER passwd=PASS action=reboot verbose=true pcmk_host_check=static-list pcmk_host_list=scv1 secure=true port=1 \ op monitor interval=60s When doing a clean start everything works fine. The problem is that, ALWAYS, after almost an hour the monitor operation of the resource fails. From the logs I see this: Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_command: Processed st_execute from lrmd.2662: Operation now in progress (-115) Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_action_create: Initiating action monitor for agent fence_apc (target=(null)) So the monitor is launched, and then: Nov 21 21:46:32 [2661] scv1 stonith-ng: info: st_child_term: Child 20854 timed out, sending SIGTERM Nov 21 21:46:32 [2661] scv1 stonith-ng: notice: stonith_action_async_done:Child process 20854 performing action 'monitor' timed out with signal 15 Nov 21 21:46:32 [2661] scv1 stonith-ng: notice: log_operation: Operation 'monitor' [20854] for device 'st_fence_scv2' returned: -62 (Timer expired) Nov 21 21:46:32 [2665] scv1 crmd:error: process_lrm_event: LRM operation st_fence_scv2_monitor_6 (464) Timed Out (timeout=2ms) So, there is a timeout (and this maybe possible since those APC devices are very slow). After this the device is stopped: Nov 21 21:46:42 [2662] scv1 lrmd: info: log_execute: executing - rsc:st_fence_scv2 action:stop call_id:469 Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command: Processed st_device_remove from lrmd.2662: OK (0) And then restarted: Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create: Initiating action metadata for agent fence_apc (target=(null)) Nov 21 21:46:42 [2661] scv1 stonith-ng: notice: stonith_device_register: Device 'st_fence_scv2' already existed in device list (1 active devices) Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command: Processed st_device_register from lrmd.2662: OK (0) Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command: Processed st_execute from lrmd.2662: Operation now in progress (-115) Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create: Initiating action monitor for agent fence_apc (target=(null)) The first thing that I find strange is the already existed in device list, but anyway after this the monitor fails again: Nov 21 21:47:02 [2661] scv1 stonith-ng: info: st_child_term: Child 21265 timed out, sending SIGTERM Nov 21 21:47:02 [2661] scv1 stonith-ng: notice: stonith_action_async_done:Child process 21265 performing action 'monitor' timed out with signal 15 Nov 21 21:47:02 [2661] scv1 stonith-ng: notice: log_operation: Operation 'monitor' [21265] for device 'st_fence_scv2' returned: -62 (Timer expired) ... ... Nov 21 21:47:03 [2661] scv1 stonith-ng: info: stonith_command: Processed st_device_remove from lrmd.2662: OK (0) After this resources remains in stopped state. Why this happens? Am I in this case: https://github.com/ClusterLabs/pacemaker/pull/334 ? What kind of workaround can I use? Thanks a lot, as usual. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Many location on ping resources and best practice for connectivity monitoring
Il giorno Ven 09 Ago 2013 04:42:28 CEST, Andrew Beekhof ha scritto: [...] That sounds like something playing with the virt bridge when the vm starts. Is the host trying to ping through the bridge too? Yes. Is this not correct? Many location constraints can reference the attribute created by a single ping resource. Its still not clear to me if you have one ping resource or one ping resource per vm... don't do the second one. I've got many locations based on the same cloned resource (which is named ping). -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Many location on ping resources and best practice for connectivity monitoring
Il giorno Gio 08 Ago 2013 01:07:06 CEST, Andrew Beekhof ha scritto: On 08/08/2013, at 12:37 AM, RaSca ra...@miamammausalinux.org wrote: [...] The problem I got is that when I clone a VM (using virt-clone) everything works fine until I try to add a new ping check. Can you more precisely describe what you mean by this? Of course: the steps for adding a new virtual machine are: - put the original resource in unmanaged; - clone the original resource via virt-clone; - add a primitive for the new vm; - add an oredr/colocation constraint over the storage; - add a location based upon the ping like this one: location loc_res_VirtualDomain_vm_connectivity res_VirtualDomain_vm \ rule -inf: not_defined ping or ping lte 0 At this point somethings breaks up. The ping resource of the node where the vm will be placed fails, making all the resources on it migrate. 1) Are there limitations about how many ping location can be declared? Well, there is a finite number of hosts that can be ping'd within a given interval. Is your timeout too short perhaps? Are you using fping which works in parallel? I'm not using fping (maybe this could be a solution) and the timeout of the ping resource is 20 (which for me, makes sense). 2) Is this one (one vm = one ping location) the best practice to monitor the connections of the nodes? ping resources were intended to check if a cluster node could reach the outside world. You're using them to check if a VM resource is alive? Perhaps David's remote-node stuff would be better suited. I'm using them to check if a resource (the vm) is on a node which can reach the outside world. So one vm = one location. Is there a way to set a location on an entire node so that if it looses the outside world all the resources on it are migrated? I was convinced that this kind of locations should be set up on the single resource. Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Many location on ping resources and best practice for connectivity monitoring
Il giorno Gio 08 Ago 2013 08:29:09 CEST, Ulrich Windl ha scritto: Hi! I don't know whether this helps, but in a different configuration we saw monitor timeouts for ipaddr2 when there was a high I/O load. Meanwhile we have upgraded all the software, but we had disabled most monitos for ipaddr2. Regards, Ulrich Hi Ulrich, thanks for you answer, I can confirm that ping fails when we have got an high load. How do you have managed the monitors for each node? You have just disabled them or you used some other work around? Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Many location on ping resources and best practice for connectivity monitoring
Hi all, I have a big Pacemaker (1.1.9-1512) cluster with 9 nodes and almost 200 virtual machines (with the same storage on the bottom). Everything is based upon KVM and libvirt. Each VM has got a location, based upon a cloned ping resource on each node that pings three hosts on the net. The problem I got is that when I clone a VM (using virt-clone) everything works fine until I try to add a new ping check. At this time, for some reason the master ping resource of the node fails, with errors like this: Jul 30 15:34:58 kvm09 lrmd[23467]: warning: child_timeout_callback: res_ping_connections_monitor_5000 process (PID 26406) timed out We're investigating on potentially network problems (obviously the network men says that those are impossible, but when the problems happens there are sometimes high ping latencies on the node), but what I find very strange is that things breaks up ONLY when I add a location based upon ping, not for example when I add the storage's order and colocation for VM. So my two questions: 1) Are there limitations about how many ping location can be declared? 2) Is this one (one vm = one ping location) the best practice to monitor the connections of the nodes? Thanks for your help, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Retransmit list and window_size
Il giorno Ven 05 Apr 2013 15:29:36 CEST, RaSca ha scritto: [...] It seem that when a configuration message has to run over the ring, in some particular cases, everything collapse. Following Florian's article I've tried setting up a window_size of 300, but since everything is the same, I think that with a default netmtu of 1500 and following the man page of corosync I must not go over 170 (which is 1500/300). The point is: what else can I check? Does it make sense to set a window_size LOWER than 50? Thanks for your help, I answer to myself, maybe it will be useful for someone else. There was no way of making multicast working in this network. It does not depend on the window_size or other kind of parameters, sometimes it breaks. Even if multicast was tested successfully (with omping and also mnc) sometimes the ring does not complete and I've got the retransmit list that make the cluster crash. The only solution I've found was to use unicast, declaring transport: udpu in corosync.conf and a member section for each node in the cluster. Doing this made everything up again, I still see the retransmit list messages, but those are in the order of 1 in an hour, so it's fine. Have you got other suggestions? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource move not moving
Il giorno Mar 16 Apr 2013 15:50:07 CEST, Marcus Bointon ha scritto: I'm running crm using heartbeat 3.0.5 pacemaker 1.1.6 on Ubuntu Lucid 64. [...] So if all that's true, why is that resource group still on the original node? Is there something else I need to do? Marcus Try using crm_resource with -f, for forcing. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Retransmit list and window_size
Hi there, In one of my clusters I still have problems with retransmit list messages. The problem is not reproducible, sometimes while the cluster is changing his state (for example when migrating a vm from one node to another) it starts with the retransmit list messages and in the worst case it loose quorum. I followed what Florian wrote here: http://www.hastexo.com/resources/hints-and-kinks/whats-totem-retransmit-list-all-about-corosync but I still got some doubts. I'm sure that this 9 node cluster is composed by identical machines and I'm quite sure that the network multicast has no problems, even if the nodes are distribuited on different enclosures. I said quite because I've done some tests with tools like MNC and the connection seems to be fine and not loosing anything. It seem that when a configuration message has to run over the ring, in some particular cases, everything collapse. Following Florian's article I've tried setting up a window_size of 300, but since everything is the same, I think that with a default netmtu of 1500 and following the man page of corosync I must not go over 170 (which is 1500/300). The point is: what else can I check? Does it make sense to set a window_size LOWER than 50? Thanks for your help, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Problem with exportfs
Il giorno Mer 13 Feb 2013 17:36:13 CET, Dejan Muhamedagic ha scritto: Hi, On Wed, Feb 13, 2013 at 02:03:16PM +0100, Ulrich Windl wrote: Hi! I've made a patch to let exportfs propagate the errors it reported to the exit code of the process (see attachments, the compressed tar is there in case the mailer corrupts the patche files): You won't get the right audience here for exportfs (the program). I'm not sure where the NFS stuff is discussed, but there's probably a public forum somewhere. Thanks, Dejan [...] There's an NFS ML, here: linux-...@vger.kernel.org it's the place were I asked (three years ago) about exportfs (http://en.usenet.digipedia.org/thread/18978/8062/). Bye, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problems with quorum, no-quorum-policy and NMI messages
Il giorno Mar 16 Ott 2012 23:44:15 CEST, Lars Marowsky-Bree ha scritto: [...] Depending on what kind of problem this node has, it could be that it erratically affects timing of network messages, or even sends garbage, which has the potential to mess up the totem protocol pretty much. What corosync version do you have? And yes, this is impossible to diagnose without the full cluster logs etc. A good candidate for bugzilla. Regards, Lars Hi Lars, thank you for your answer. I know that without the full logs doing a coherent analysis is impossible, but as you can imagine there are a lot of logs about this problem and yes, I will fill a bugzilla as soon as possible. Some other informations about the systems: OS version: CentOS release 6.2 (Final) Kernel version: 2.6.32-220.23.1.el6.x86_64 Corosync version: corosync-1.4.1-4.el6_2.3.x86_64 Going deep into the failed node I saw also these message: ERST: Can not request iomem region 0x88103419be60-0x102068337cc0 for ERST. From the Red Hat's Knowledge Base it seems that the root cause is a kernel problem with the ERST (Error record Serialization Table) access. The resolution suggested is to upgrade kernel versione 2.6.32-279.el6. I just need to know if this error is a consequence of the original one (NMI) or it is the cause. What I know is that it appeared after the NMI error so, maybe, it is a consequence. As I said, I will fill a bugzilla soon. Thanks again, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Time based resource stickiness example with crm configure ?
Il giorno Gio 30 Ago 2012 14:53:45 CEST, Stefan Schloesser ha scritto: Hi, I would like to configure the resource-stickiness to 0 tuesdays between 2 and 2:20 am local time. I could not find any examples on how to do this using crm configure ... but only the XML snippets to accomplish this. Could someone point me to the documentation or give me an example on the syntax? Thanks, Stefan Did you took a look to Pacemaker Explained? In any case you can configure a cron job that runs the time you want and launch crm configure property default-resource-stickiness=0. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Best way to know on which host a resource has failed and where it will be promoted
Hi all, I want to interact with the new master election. I don't know if I must operate at a Resource Agent level or at cluster level, so I'm opened to suggestions. Suppose I've got a multi state resource for which I have one master and two (or more) slaves. Suppose than that the master resource fails. At this point and BEFORE the new master is elected I need to do a software check (based upon the name of the failed host) that returns to me the best new master. I want to pass this host to the cluster so that it can promote this new host. Is there a way to configure those kind of scripts at cluster level (like we do with locationos) or I must interact with the resource agent somewhere in the notify part (like assigning different weight dynamically to the nodes)? Thank you all for any suggestion. -- Raoul Scarazzini Solution Architect MMUL: Niente è impossibile da realizzare, se lo pensi bene! +39 3281776712 ra...@mmul.it http://www.mmul.it -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource
Il giorno Mer 22 Ago 2012 09:00:55 CEST, Ulrich Windl ha scritto: [...] Hi! Amazingly the primary key (=ID) of the monitor operations is built using the ineterval, not the role. So if you have to monitor operations with the same interval, you have a resource conflict. It's documented, although it's a sick concept... Decide whose fault it is... yours or the CRMs... Regards, Ulrich Thank you Ulrich, As far as you know, Is there a way to override the ID for each cloned instance of the mysql resource? How can I resolve the problem? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource
Il giorno Mer 22 Ago 2012 10:11:52 CEST, Lars Marowsky-Bree ha scritto: Just make the intervals slightly different - 31s, 30s, 29s ... Regards, Lars Thank you Lars, In fact, this is what I've done and now everything is ok. But I want to understand one last thing: if the ID is calculated with the value of interval then why I don't have errors even if I've got two slaves, which means that I've got two identical intervals? I hope to have made myself clear. Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Duplicate monitor operation on a multi state resource
Hi all, I'm trying to use the mysql resource agent to manage a setup with one master and two slaves. This is the configuration of the mysql resource and the master/slave one: primitive resMySQL ocf:custom:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf datadir=/var/lib/mysql user=mysql replication_user=myuser replication_passwd=mypassword \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op promote interval=0 timeout=120 \ op demote interval=0 timeout=120 \ op monitor interval=10 role=Master timeout=30 \ op monitor interval=10 role=Slave timeout=30 ms ms_resMySQL resMySQL \ meta master-max=1 master-node-max=1 clone-node-max=1 clone-max=3 notify=true globally-unique=false The problem is that I see from the logs some errors like these: Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of resMySQL-monitor-10 Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR: is_op_dup: Do not use the same (name, interval) combination more than once per resource and in fact, even if I manually kill the process in a node, the cluster isn't aware and does not react. What is wrong with this ms resource? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?
Il giorno Mer 06 Giu 2012 23:03:49 CEST, Lars Ellenberg ha scritto: [...] Two globally-unique clones I came accross in real life: Cluster IP buckets, in the sense of the iptables CLUSTERIP target. Sequences of IPs generated by the IPaddr2 resource, where the clone id is added to the base IP. Both will also need to allow clone-node-max 1, and one node will host more than one clone instances in the failover case. Thank you Lars, now everything is more clear. Andrew: how about putting another example in the Pacemaker Explained docs about this kind of resources? I mean extending the Clones chapter (here http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-clone.html) with another example of a globally-unique resource like the one described by Lars. The example that is there now is about an anonymous clone, so I think it will be useful to have the box of this one under the anonymous description and the globally-unique example below the Globally unique description. For the stateful resources there is no problem since they have a chapter apart. Lars could please provide a sample xml part of the solution you have suggested (one or the other is the same)? I can modify the docs by myself and then submit the patch to Andrew. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Does globally-unique make sense on filesystems cloned resources?
Hi all, I've configured an NFS share which is cloned on each node of my cluster. What I need to understand is how the globally-unique parameter applies to this situation. Starting from it's definition: Globally unique clones are distinct entities. A copy of the clone running on one machine is not equivalent to another instance on another node. Nor would any two copies on the same node be equivalent. How this can be applied to filesystem resources? In addition, I've set up this parameter to true, since my filesystems are identical on each node, but does this makes sense? Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?
Il giorno Mer 06 Giu 2012 16:53:28 CEST, Florian Haas ha scritto: [...] Nope. :) Quite the contrary. It's the same filesystem you're mounting everywhere. That's a relatively classic anonymous clone. The globally-unique=false default should apply here. Cheers, Florian Thank you Florian, but how can one declare an anonymous clone? Is it implicit with the globally-unique=false? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?
Il giorno Mer 06 Giu 2012 17:35:03 CEST, Lars Marowsky-Bree ha scritto: On 2012-06-06T17:26:41, RaSca ra...@miamammausalinux.org wrote: Thank you Florian, but how can one declare an anonymous clone? Is it implicit with the globally-unique=false? You don't need to explicitly declare that. It is the default. (But yes, the default is globally-unique=false.) Regards, Lars Just for completeness: could you please mention a resource that might be globally-unique? Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat question about multiple services
Il giorno Ven 20 Apr 2012 12:42:16 CEST, sgm ha scritto: Hi, I have a question about heartbeat, if I have three services, apache, mysql and sendmail,if apache is down, heartbeat will switch all the services to the standby server, right? If so, how to configure heartbeat to avoid this happen? Very Appreciated.gm You may want to start from here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Can a HA cluster be built with nodes in different VLANs?
Il giorno Mer 08 Feb 2012 02:51:40 CET, Ryan Stepalavich ha scritto: Good evening, I'm currently attempting to build a LAMP high-availability cluster in Ubuntu 11.10. The trick is that each node is in a different VLAN. This causes Heartbeat to die when trying to fail over the hosted IP into an invalid VLAN. Site 1 VLAN: 10.204.200.0/24 Site 2 VLAN: 10.204.202.0/24 Is there a way around this issue? Thanks! The only way out you have is to put a balancer over all, it will balance the load in the two different lans. You can do it with hardware load balancers and also with Linux LVS (i.e. ldirectord). -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker : how to modify configuration ?
Il giorno Lun 28 Nov 2011 15:04:45 CET, alain.mou...@bull.net ha scritto: Hi sorry but I forgot if there is another way than crm configure edit to modify all the value of on-fail= for all resources in the configuration ? Thanks Alain Why not using :%s/on-fail=.*/on-fail=newvalue/g in crm configure edit? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker : how to modify configuration ?
Il giorno Mar 29 Nov 2011 09:30:41 CET, alain.mou...@bull.net ha scritto: Hi Yes I know it is possible this way, but I don't like to tell anybody to use crm configure edit because it is a command a little bit risky, risk of corruption of the file ... when I'm the person who operates, I often use crm configure edit, but I'm a little reluctant to tell somebody else not really a pacemaker specialist to use this command. So I'd prefer a command with cibadmin/grep/sed as Andrew suggest it. Thanks Alain Consider that a bad configuration is not being processed by the crm editor. In addition it is possible to do a dump of the actual configuration before doing any modifications. That said... If you're reluctant to make a non specialist users modify the configuration, then why let them modify delicate parameters such as on-fail? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LCMC NumberFormatException
Il giorno Gio 13 Ott 2011 09:28:33 CEST, RaSca ha scritto: Il giorno Mer 12 Ott 2011 22:22:15 CEST, Rasto Levrinc ha scritto: [...] Well, LCMC doesn't handle k in this type of fields, as Caspar said. Will be fixed. As a workaround you can set it to 131072. Rasto Clear, thanks. With LCMC 1.0.2 the problem is solved. Thank you Rasto! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LCMC NumberFormatException
Il giorno Mer 12 Ott 2011 22:22:15 CEST, Rasto Levrinc ha scritto: [...] Well, LCMC doesn't handle k in this type of fields, as Caspar said. Will be fixed. As a workaround you can set it to 131072. Rasto Clear, thanks. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] LCMC NumberFormatException
Hi all, I'm facing this problem while connecting to one of my pacemaker clusters with LCMC (known also as DMC): AppError.Text release: 1.0.1 java: Sun Microsystems Inc. 1.6.0_26 uncaught exception java.lang.NumberFormatException: For input string: 128k For input string: 128k java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) java.lang.Long.parseLong(Long.java:419) java.lang.Long.parseLong(Long.java:468) lcmc.data.DrbdXML.checkParam(DrbdXML.java:419) lcmc.gui.resources.DrbdResourceInfo.checkParam(DrbdResourceInfo.java:238) lcmc.gui.resources.EditableInfo.checkResourceFieldsCorrect(EditableInfo.java:957) lcmc.gui.resources.DrbdResourceInfo.checkResourceFieldsCorrect(DrbdResourceInfo.java:792) lcmc.gui.resources.DrbdResourceInfo.checkResourceFieldsCorrect(DrbdResourceInfo.java:728) lcmc.gui.resources.EditableInfo$7.run(EditableInfo.java:601) java.lang.Thread.run(Thread.java:662) What the problem should be? Please tell me also if this message is off topic for the list... -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LCMC NumberFormatException
Il giorno Mer 12 Ott 2011 15:09:24 CEST, Rasto Levrinc ha scritto: On Wed, Oct 12, 2011 at 2:49 PM, Caspar Smitc.s...@truebit.nl wrote: Hi Rasca, Probably the sndbuf-size (or any other variable) in drbd.conf is set to 128K That shouldn't be a problem, normally. Rasca, what parameters are set to 128k and what DRBD version(s) do you have? Maybe 128K would work. Rasto max-buffers 128k; in global.conf The DRBD version is 8.3.10. Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Understanding why a host fence (was: Resource fail and node fence)
Il giorno Mar 27 Set 2011 06:28:00 CEST, Andrew Beekhof ha scritto: [...] /stop/ failed, your on-fail setting only applies to the /monitor/ operation Yes Andrew, now it's absolutely clear. Increasing the timeout for the migration to 240s and setting the on-fail=block for the stop operation solved my problems, even if I think that just the timeouts have done the trick, setting also the on-fail gives me more control over the errors on the migrations, since a single vm migration failure does not make an entire node fence. Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource fail and node fence
Il giorno Mar 20 Set 2011 17:54:58 CEST, Dejan Muhamedagic ha scritto: [...] And I completely agree with this, but in an environment like mine, where a single resource failure might involve all the others (with fence) is wronk to keep this kind of settings. Do you agree with me? No. If the resource cannot stop, then something's wrong either with the resource or with the RA. And needs to be fixed. And this is for sure. But I cannot make all resources on a node stop and migrate (because of fence) just because one of them has failed. Those resources are not connected one to each other, so it is more reasonable to keep the survived resources alive, or at least live migrating them (on-fail=standby) to the other node and THEN reboot the first one. [...] If a resource fails to stop, then on-fail=stop cannot possibly help. Furthermore, you basically make this resource less available (the cluster won't try to recover it). Must be that I'm missing something. At any rate, I don't think that you need to fiddle with the on-fail attribute, but see what's wrong with the RA or libvirt or the combination of the two. Thanks, Dejan Then my question is why the attribute on-fail was created? I repeat, I totally agree with the fact that there are some problems with the RA but until I find out exactly what's wrong I had to take care of all my cluster's resources and so the most reasonable thing is to keep one single failed vm stopped. Thanks, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Lun 19 Set 2011 21:20:12 CEST, Michael Schwartzkopff ha scritto: Il giorno Lun 19 Set 2011 15:17:30 CEST, Michael Schwartzkopff ha scritto: [...] What transport do you use? Which version of libvirt? Transport ssh, libvirt version is 0.9.2-7 (the squeeze-backports version). I get an error: migration job: unexpectedly failed Only if I add the above mentioned options to the migration it works. Hi Micheal, does this happens *every time* you try to migrate the resource? I'm facing a strange problem with virtualdomain RA and node fence (see my post on linux-ha ML Understanding why a host fence (was: Resource fail and node fence)) and I'm trying to understand where is the problem. Have you tried using another transport such ad tcp over sasl or tls? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Mar 20 Set 2011 11:11:58 CEST, Michael Schwartzkopff ha scritto: [...] Yes. This happens every time. Also when I use the virsh command line to migrate the virtual machine. So the problem is inside libvirt. But are you using the same libvirt version of mine (0.9.2-7)? Your problem. Sorry, I have never seen this. I had several other issues with VirtualDomain. Please see my patches to the RA: http://www.gossamer-threads.com/lists/linuxha/dev/74103 I saw your path and might be very useful. My problem is exactly on stop. When the error come up, the vm is in state paused and it seems to be this to make things broke up. First tests with tcp show that it also needs the --p2p --tunnelled options. I did not try sasl. Greetings, Very strange. I think that the problem may be somewhere else (network connection, collisions, physical staff)... -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Mar 20 Set 2011 12:27:20 CEST, Michael Schwartzkopff ha scritto: [...] First tests with tcp show that it also needs the --p2p --tunnelled options. I did not try sasl. Greetings, Very strange. I think that the problem may be somewhere else (network connection, collisions, physical staff)... No. See man virsh. I also have 0.9.2-7~bpo60+1. Man page says only the p2p is peer to peer migration and tunnelled is tunnelled migration. And I agree with this :-) For what I've found: - tunnelled migration sets a migration in progress in the background, so that libvirt is able to cancel a migration when a problem has happened. - peer2peer make the libvirtd server connect to the destination libvirtd server directly (peer-to-peer) setting up a secure channel. In the first case I may agree that the parameter should help (even only on failures), but for p2p... I can't say it. Anyway, it doesn't explain why for me this version works correctly. If you want I can send you my crm and libvirt configurations so we can do some comparisons. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Mar 20 Set 2011 13:45:30 CEST, Michael Schwartzkopff ha scritto: [...] the cluster is completely irrelevant here. the plain command virsh migrateguest qemu+ssh://other_node/system doesn't work here. I need the options --p2p --tunnelled. So this is a libvirt issue, no cluster issue. But if there parameters are really needed, the VirtualDomain has to be patched. Greetings, This is for sure. Anyway this is my libvirt.conf: listen_tls = 0 unix_sock_group = libvirt unix_sock_rw_perms = 0770 auth_unix_ro = none auth_unix_rw = none I haven't changed anything else. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-HA] Understanding why a host fence (was: Resource fail and node fence)
Hi all, I start a new thread because I've got more debug details to analyze my situation, and starting from the beginning might be better. My environment is composed by two machine connected to a network and one to each other. The cluster runs a lot of virtual machines, each one based upon a dual primary drbd. The two systems are Debian Squeeze with backports: kernel 2.6.39-3 drbd 8.3.10-1 corosync 1.3.0-3 pacemaker 1.0.11-1 libvirt-bin 0.9.2-7 The (dual-primary) drbd resources are declared in this way: primitive vm-1_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=20s role=Master timeout=20s \ op monitor interval=30s role=Slave timeout=20s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s ms vm-1_ms-r0 vm-1_r0 \ meta notify=true master-max=2 clone-max=2 interleave=true and the virtual machine are like this: primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/vm-1.xml hypervisor=qemu:///system migration_transport=ssh force_stop=true \ meta allow-migrate=true \ op monitor interval=10s timeout=30s on-fail=restart depth=0 \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s There are colocation and order for each vm: colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start And there is a location constraint for the connectivity: location vm-1_ON_CONNECTED_NODE vm-1 \ rule $id=vm-1_ON_CONNECTED_NODE-rule -inf: not_defined ping or ping lte 0 The problem is that every night I've scheduled a live migration of a vm, but if this fails, then the node gets fenced, even if the on-fail parameter of the vm is set to restart. Everything starts at 23: Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource -M -r vm-1 Two seconds later the first problem: Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148, its parameters: hypervisor=[qemu:///system] CRM_m eta_depth=[0] CRM_meta_timeout=[3] force_stop=[true] config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0] crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito r] migration_transport=[ssh] CRM_meta_interval=[1] cancelled why this operation is marked ad cancelled? Anyway, after 22 seconds, the operation fails with Timed Out: Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=2ms) Force shutdown is invoked: Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced shutdown (destroy) request for domain vm-1. and even if the vm appears to be destroyed (the kernel messages confirm the the vmnet devices were destroyed), the RA seems to ignore it: Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: (vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1 Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: (vm-1_virtualdomain:stop:stderr) error: Requested operation is not valid: domain is not running Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445, confirmed=true) unknown error In the meantime on the other node, since some errors are discovered: Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: Migrating vm-1_virtualdomain from node-2 to node-1 Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (100) ... ... Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2) Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1) Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2 will be fenced to recover from resource failure(s) a STONITH is invoked... Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid: 2314] requests a STONITH operation RESET on node node-2 ...with success: Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the node node-2: optype=RESET. whodoit: node-1 My conclusions are: 1 - the fence has nothing to do with drbd (there is no mention to it until the reset is done); 2 - for some reason live migrating the vms SOMETIMES fails, even if once the system has recovered I can do a crm resource move vm-1 with ANY problem. 3 - Even if the vm fails to stop the cluster does not try to restart it, but simply fence the node, and this is not what the on-fail parameter is meant to do. Does someone have some suggestions on how to debug more this problem? Please help! Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente è
Re: [Linux-HA] Resource fail and node fence
Il giorno Mar 20 Set 2011 12:53:40 CEST, Dejan Muhamedagic ha scritto: [...] on-fail is a per-operation attribute. By default, it is set to fence only for the stop operation. The point is that a failed stop means that the cluster cannot establish the state of the resource anymore, so the only remedy remaining is to fence the node. Thanks, Dejan And I completely agree with this, but in an environment like mine, where a single resource failure might involve all the others (with fence) is wronk to keep this kind of settings. Do you agree with me? It has more sense to have the resource stopped or unmanaged instead of doing a node fence. So, even if I had to patch the Virtualdomain RA with Micheal's one (http://www.gossamer-threads.com/lists/linuxha/dev/74103) it may be useful for me to set the stop operations on-fail parameter to stop, or maybe block, like this: op monitor interval=10 timeout=30 start-delay=0 on-fail=stop Am I right? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Lun 19 Set 2011 12:17:30 CEST, Michael Schwartzkopff ha scritto: Hi, I tested migration with a recent version of libvirt. Essentially it was from squeeze-backports. The transport qemu+ssh://other_node/system now needs the additional parameters --p2p --tunnelled in the migration command line. So it should be: virsh migrateguest --p2p --tunnelled qemu+ssh://other_node/system I will try to write a patch and publish it here. Greetings, Hi Michael, I'm using Debian Squeeze with Pacemaker and libvirt from backports, but I didn't have to make any patch to make live migration work. Have you got a specific setup? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Issue with VirtualDomain
Il giorno Lun 19 Set 2011 15:17:30 CEST, Michael Schwartzkopff ha scritto: [...] What transport do you use? Which version of libvirt? Transport ssh, libvirt version is 0.9.2-7 (the squeeze-backports version). -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Resource fail and node fence
Il giorno Lun 19 Set 2011 17:55:58 CEST, Dejan Muhamedagic ha scritto: Hi, On Wed, Sep 14, 2011 at 09:43:43AM +0200, RaSca wrote: Hi all, I've got a two node pacemaker/corosync cluster with some virtual domain resources on some DRBD devices. Every DRBD device is configured in dual primary setup and I have enabled the live migration. Cluster has also stonith enabled. My problem is that if a live migration for a single virtualdomain resource fails, then this node gets fenced, making unavailable also all AFAIK, failing migration shouldn't result in node fence. I guess that actually the subsequent stop operation failed, right? In that case, that's probably a bug somewhere in the RA or VM code. Thanks, Dejan Hi Dejan, thanks as usual for your response. In the end, since that I was facing too much unexplainable problems I decided to upgrade libvirt and the kernel itself to a newer version (from squeeze to squeeze-backports). Until now problems seems to be resolved. In Pacemaker Explained (Andrew, I'm almost finished with the translation, I swear!) it is written that the default action on fail is fence, so it is assumed that if a single resource fails, then the entire node is fenced. Note that at the moment every of my virtualdomain resource have got the on-fail action set with restart, and I've not faced any fence. But please, help me to understand this: what do you mean with subsequent stop operation? It is very plausible that this was the reason, since the failed virtual machines were in state paused even if I was forcing the stop. Does this is enough to make a node fence? Why this failure is not considered in on-fail parameter declaration? Do I made myself clear? Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Resource fail and node fence
Hi all, I've got a two node pacemaker/corosync cluster with some virtual domain resources on some DRBD devices. Every DRBD device is configured in dual primary setup and I have enabled the live migration. Cluster has also stonith enabled. My problem is that if a live migration for a single virtualdomain resource fails, then this node gets fenced, making unavailable also all the other virtual machines (that gets restarted on the other node after a poweroff). As I saw the way to make a single resource fail not fencing the node where it fails is to declare an on-fail=restart option for the virtual domain. Is it the correct approach or is there a more elegant way to obtain what I want? Thanks to all, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Linux HA] Problem with Apache
Il giorno Mer 08 Giu 2011 11:26:01 CET, Alfredo Parisi ha scritto: Hi and thanks for the response. if I open my browser with my local ip, I see that the web server doesn't works. Which log do you want? apache2 or corosync? Sorry but I'm a newbie on Linux HA. Thanks UPDATE: So i've checked again and in the first time I've removed apache at the boot, but if I start the service apache, it works in the both nodes, with the both local IP and with the Virtual Ip So in which mode can I install Drupal on my Cluster? Whoa. There are a lot of things you need to clarify to yourself. First of all, forget about the CMS for now and concentrate on putting into the cluster the service Apache. As I said, you need to check why in the cluster configuration Apache does not works, and this reason is in the logs. After that, installing the CMS will be trivial. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Linux HA] Problem with Apache
Il giorno Mer 08 Giu 2011 11:53:54 CET, Alfredo Parisi ha scritto: Ok thank you. Now apache works on the virtual ip (10.10.7.100). With crm_mon I haven't errors, this is my situation: [...] Now can I install the CMS? Thanks I hope so :-) -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] What means this type of errors ?
Il giorno Mar 07 Giu 2011 11:16:56 CET, alain.mou...@bull.net ha scritto: Sorry , some mistakes in preivous logs, here are the real ones : [...] You need to look for your problems BEFORE these logs. The problem is with a Filesystem, so you need to search for errors concerning this resource by looking for it's name BEFORE all those messages that shows how it has been sent away from a node. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Gio 26 Mag 2011 23:01:48 CET, Lars Ellenberg ha scritto: [...] Would that not be a power on? alright, seems to just be change boot settings, not boot per se. Oh well... Yes, as you saw, it is not some kind of boot mode, but just a setting. Note, in addition, that there's also a wake-on-lan function that maybe used, but there is no power off function and so, some kind of poweron will be not useful. Thanks for all your suggestions. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Ven 27 Mag 2011 14:41:32 CET, Dejan Muhamedagic ha scritto: Hi, [...] # Can't really be implemented because Hetzner webservice cannot power on a system Replace comments with ha_log.sh calls. Hi Dej, everything else it's clear and I've already updated the script, but what do you mean with this? What's the matter with the comments? The only way I found to call ha_log.sh is with err, warn, info or debug. How comments can be part of this? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Ven 27 Mag 2011 15:02:11 CET, RaSca ha scritto: Il giorno Ven 27 Mag 2011 14:41:32 CET, Dejan Muhamedagic ha scritto: Hi, [...] # Can't really be implemented because Hetzner webservice cannot power on a system Replace comments with ha_log.sh calls. Hi Dej, everything else it's clear and I've already updated the script, but what do you mean with this? What's the matter with the comments? The only way I found to call ha_log.sh is with err, warn, info or debug. How comments can be part of this? Wait, I think I understand, do you mean something like: ha_log.sh warn Can't really be implemented because Hetzner webservice cannot power on a system Am I right? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Ven 27 Mag 2011 15:33:24 CET, Dejan Muhamedagic ha scritto: [...] Yes. It's because users cannot read comments that easily, they usually look at the logs. If at all. Cheers, New (and hopefully last) version attached. Bye, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org #!/bin/sh # # External STONITH module for Hetzner. # # Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # # Read parameters from config file, format is based upon the hetzner OCF resource agent # developed by Kumina: http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/ conf_file=/etc/hetzner.cfg user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg` pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg` hetzner_server=https://robot-ws.your-server.de; check_http_response() { # If the response is 200 then return 0 if [ $1 = 200 ] then return 0 else # If the response is not 200 then display a description of the problem and return 1 case $1 in 400) ha_log.sh err INVALID_INPUT - Invalid input parameters ;; 404) ha_log.sh err SERVER_NOT_FOUND - Server with ip $remote_ip not found ;; 409) ha_log.sh err RESET_MANUAL_ACTIVE - There is already a running manual reset ;; 500) ha_log.sh err RESET_FAILED - Resetting failed due to an internal error ;; esac return 1 fi } case $1 in gethosts) echo $hostname exit 0 ;; on) # Can't really be implemented because Hetzner's webservice cannot power on a system ha_log.sh err Power on is not available since Hetzner's webservice can't do this operation. exit 1 ;; off) # Can't really be implemented because Hetzner's webservice cannot power on a system ha_log.sh err Power off is not available since Hetzner's webservice can't do this operation. exit 1 ;; reset) # Launching the reset action via webservice check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw) exit $? ;; status) # Check if we can contact the webservice check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u $user:$pass $hetzner_server/server/$remote_ip) exit $? ;; getconfignames) echo hostname echo remote_ip exit 0 ;; getinfo-devid) echo Hetzner STONITH device exit 0 ;; getinfo-devname) echo Hetzner STONITH external device exit 0 ;; getinfo-devdescr) echo Hetzner host reset echo Manages the remote webservice for reset a remote server. exit 0 ;; getinfo-devurl) echo http://wiki.hetzner.de/index.php/Robot_Webservice_en; exit 0 ;; getinfo-xml) cat HETZNERXML parameters parameter name=hostname unique=1 content type=string / shortdesc lang=en Hostname /shortdesc longdesc lang=en The name of the host to be managed by this STONITH device. /longdesc /parameter parameter name=remote_ip unique=1 required=1 content type=string / shortdesc lang=en Remote IP /shortdesc longdesc lang=en The address of the remote IP that manages this server. /longdesc /parameter /parameters HETZNERXML exit 0 ;; *) ha_log.sh err Don't know what to do for '$remote_ip' exit 1 ;; esac ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Mar 24 Mag 2011 16:10:17 CET, Lars Ellenberg ha scritto: [...] exit $((! $?)) That is going to invert the code. the shell has ! for that. ! is_host_up Coffee? ;-) Ok, following your suggestions I've modified (and tested, of course) the script. Compacting as much as I can. But Lars, sorry how didn't find out how to compact this: is_host_up $remote_ip exit $((! $?)) in this: exit ! is_host_up $remote_ip :-( Anyway, now I've got a deeper problem: I was totally misunderstanding what's the status field of the curl interrogation was meant for. So I corrected the is_host_up function, making it check (similar to ssh stonith agent) via nc if the ssh port is responding: is_host_up() { /bin/nc -w 1 -z $1 22 /dev/null 21 return $? } This sounds quite weird to me, but I can't do ping, and I can't control the state of the machine otherwise. I can only force the reset and then check if the machine is up. If you have any suggestions on how to make things better, don't hesitate... Beyond all, it works. I set up a variable which is the timeout to wait before check the machine status. It can become a parameter. The new version is attached. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org #!/bin/sh # # External STONITH module for Hetzner. # # Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # # Read parameters conf_file=/etc/hetzner.cfg user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg` pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg` hetzner_server=https://robot-ws.your-server.de; wait_timeout=15 is_host_up() { /bin/nc -w 1 -z $1 22 /dev/null 21 return $? } case $1 in gethosts) echo $hostname exit 0 ;; on) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; off) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; reset) curl -s -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw /dev/null 21 sleep $wait_timeout is_host_up $remote_ip exit $((! $?)) ;; status) is_host_up $remote_ip exit $? ;; getconfignames) echo hostname exit 0 ;; getinfo-devid) echo Hetzner STONITH device exit 0 ;; getinfo-devname) echo Hetzner STONITH external device exit 0 ;; getinfo-devdescr) echo Hetzner host reset echo Manages the remote webservice for reset a remote server. exit 0 ;; getinfo-devurl) echo http://wiki.hetzner.de/index.php/Robot_Webservice_en; exit 0 ;; getinfo-xml) cat HETZNERXML parameters parameter name=hostname unique=1 content type=string / shortdesc lang=en Hostname /shortdesc longdesc lang=en The name of the host to be managed by this STONITH device. /longdesc /parameter parameter name=remote_ip unique=1 required=1 content type=string / shortdesc lang=en Remote IP /shortdesc longdesc lang=en The address of the remote IP that manages this server. /longdesc /parameter /parameters HETZNERXML exit 0 ;; *) exit 1 ;; esac ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent
Il giorno Gio 26 Mag 2011 11:13:46 CET, RaSca ha scritto: [...] The new version is attached. Hi all, After talking with Dejan on IRC, here it is the new version of the agent. Major changes: - The script do not relies anymore on SSH for checking the correct fence of the device, instead it checks the http response code from the webservice; - The status action looks for a 200 response from the webservice in GET mode; - In case of problems, the return code of the RA is 1 and I also added a description of the problem (check_http_response function); That's all, last but not least, it works. To make things perfect I just ask to Lars (following two days ago discussion) if there's a way to compact this: check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw) exit $? to a one line statement. Thanks everybody for the help, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org #!/bin/sh # # External STONITH module for Hetzner. # # Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # # Read parameters from config file, format is based upon the hetzner OCF resource agent # developed by Kumina: http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/ conf_file=/etc/hetzner.cfg user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg` pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg` hetzner_server=https://robot-ws.your-server.de; check_http_response() { # If the response is 200 then return 0 if [ $1 = 200 ] then return 0 else # If the response is not 200 then display a description of the problem and return 1 case $1 in 400) echo INVALID_INPUT - Invalid input parameters ;; 404) echo SERVER_NOT_FOUND - Server with ip $remote_ip not found ;; 409) echo RESET_MANUAL_ACTIVE - There is already a running manual reset ;; 500) echo RESET_FAILED - Resetting failed due to an internal error ;; esac return 1 fi } case $1 in gethosts) echo $hostname exit 0 ;; on) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; off) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; reset) # Launching the reset action via webservice check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw) exit $? ;; status) # Check if we can contact the webservice check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u $user:$pass $hetzner_server/server/$remote_ip) exit $? ;; getconfignames) echo hostname echo remote_ip exit 0 ;; getinfo-devid) echo Hetzner STONITH device exit 0 ;; getinfo-devname) echo Hetzner STONITH external device exit 0 ;; getinfo-devdescr) echo Hetzner host reset echo Manages the remote webservice for reset a remote server. exit 0 ;; getinfo-devurl) echo http://wiki.hetzner.de/index.php/Robot_Webservice_en; exit 0 ;; getinfo-xml) cat HETZNERXML parameters parameter name=hostname unique=1 content type=string / shortdesc lang=en Hostname /shortdesc longdesc lang=en The name of the host to be managed by this STONITH device. /longdesc /parameter parameter name=remote_ip unique=1 required=1 content type=string / shortdesc lang=en Remote IP /shortdesc longdesc lang=en The address of the remote IP that manages this server. /longdesc /parameter /parameters HETZNERXML exit 0 ;; *) exit 1 ;; esac
Re: [Linux-HA] Colocation of VIP and httpd
Il giorno Gio 19 Mag 2011 19:25:54 CET, 吴鸿宇 ha scritto: Hi All, I have a 2 node cluster. My intention is ensuring the VIP is always on the node that has httpd running, i.e. if service httpd on the VIP node is stopped and fails to start, the VIP should switch to the other node. With the configuration below, I observed that when httpd stops and fails to start, the VIP is stopped also but is not switched to the other node that has healthy httpd. I appreciate any ideas. [...] Some questions: Why httpd is cloned? Are you sure you want an INFINITY stickiness? Are logs saying anything helpful? Anyway, like Nikita said, consider upgrading Heartbeat to version 3. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Hetzner server stonith agent
Hi all, as some of you saw in the last two weeks I've faced some problems in configuring a Corosync/Pacemaker cluster on two Hetzner server. The main problem about those cheap and very powerful servers is their network management. For example, if you want to have a failover IP you need to manage it by the web interface or via a webservice, there's no other way. Luckily, the guys from Kumina (http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/) wrote an ocf resource agent that automates the management of the IP so the last (but not least) problem was the Stonith. In the intention of Hetzner the only way you have to force a reset of the machine is... via the same webserver. I know, it's odd, but also in this case it is the only way. So, following those directives: http://wiki.hetzner.de/index.php/Robot_Webservice_en I wrote the stonith agent that is attached to this email. It is based upon the same configuration file of the Kumina's ocf: # cat /etc/hetzner.cfg [dummy] user = username pass = password local_ip = local ip address of the server And it needs two parameters: the hostname and it's related remote_ip, for example: primitive stonith_hserver-1 stonith:external/hetzner \ params hostname=hserver-1 remote_ip=X.Y.Z.G \ op start interval=0 timeout=60s First of all, it works. The system is able to fence nodes in case of split brain and manually, so I can say it is ok. But it is the first stonith agent that I wrote, so it may need some corrections. Hope this can help someone. Thanks to andreask who helped me on irc in understanding how stonith agents works. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org #!/bin/sh # # External STONITH module for Hetzner. # # Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # # Read parameters conf_file=/etc/hetzner.cfg user=`cat /etc/hetzner.cfg | egrep ^user.*= | sed 's/^user.*=\ *//g'` pass=`cat /etc/hetzner.cfg | egrep ^pass.*= | sed 's/^pass.*=\ *//g'` hetzner_server=https://robot-ws.your-server.de; is_host_up() { if [ $1 != ] then status=`curl -s -u $user:$pass $hetzner_server/server/$1 | sed 's/.*status\:\([A-Za-z]*\),.*/\1/g'` if [ $status = ready ] then return 0 else return 1 fi else return 1 fi } case $1 in gethosts) echo $hostname exit 0 ;; on) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; off) # Can't really be implemented because Hetzner webservice cannot power on a system exit 1 ;; reset) status=`curl -s -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw` if [ $status = ] then exit 1 else if is_host_up $hostaddress then exit 1 else exit 0 fi fi exit 1 ;; status) if [ $remote_ip != ] then if is_host_up $remote_ip then exit 0 else exit 1 fi else # Check if we can contact the server status=`curl -s -u $user:$pass $hetzner_server/server/` if [ $status = ] then exit 1 else exit 0 fi fi ;; getconfignames) echo hostname exit 0 ;; getinfo-devid) echo Hetzner STONITH device exit 0 ;; getinfo-devname) echo Hetzner STONITH external device exit 0 ;; getinfo-devdescr) echo Hetzner host reset echo Manages the remote webservice for reset a remote server. exit 0 ;; getinfo-devurl) echo http://wiki.hetzner.de/index.php/Robot_Webservice_en; exit 0 ;; getinfo-xml) cat
Re: [Linux-HA] Hetzner server stonith agent
Il giorno Mar 24 Mag 2011 12:27:04 CET, Dejan Muhamedagic ha scritto: Hi, Hi Dejan, [...] # Read parameters conf_file=/etc/hetzner.cfg user=`cat /etc/hetzner.cfg | egrep ^user.*= | sed 's/^user.*=\ *//g'` Better: user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg` Absolutely agree. pass=`cat /etc/hetzner.cfg | egrep ^pass.*= | sed 's/^pass.*=\ *//g'` hetzner_server=https://robot-ws.your-server.de; I assume that this is a well-known URL which doesn't need to be passed as a parameter. As far as I know it is the only address, I hard-coded it for this reason, but maybe should be a parameter... is_host_up() { if [ $1 != ] then status=`curl -s -u $user:$pass $hetzner_server/server/$1 | sed 's/.*status\:\([A-Za-z]*\),.*/\1/g'` if [ $status = ready ] then return 0 else return 1 fi This if statement can be reduced to (you save 5 lines): [ $status = ready ] else return 1 fi } You mean the statement should be: [ $status = ready ] return 0 return 1 ? [...] Again, better (is return code of is_host_up inverted?): is_host_up $hostaddress exit # this is actually also superfluous, but perhaps better left in The action is reset, so if I had success then is_host_up must be NOT ready. Or not? [...] Ditto. Good work! Cheers, Dejan P.S. Moving discussion to linux-ha-dev. If the compact way is correct, I can modify the script and post it again. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Hetzner server stonith agent
Il giorno Mar 24 Mag 2011 12:44:42 CET, RaSca ha scritto: Il giorno Mar 24 Mag 2011 12:27:04 CET, Dejan Muhamedagic ha scritto: [...] P.S. Moving discussion to linux-ha-dev. [...] Sorry... I removed the wrong address and posted again on linux-ha :-( -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Best way for colocating resource on a dual primary drbd
Il giorno Lun 16 Mag 2011 09:01:08 CET, Andrew Beekhof ha scritto: [...] Implicit that once the resource go away it becomes slave? Pretty sure this is a bug in 1.0. Have you tried 1.1.5 ? Not yet, but so Andrew are you saying that keeping the colocation even if I have a dual primary drbd is the best thing to do? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Event: an Heartbeat/Corosync/DRBD/Pacemaker free seminar in Rho (Milan, Italy), on June 24 2011
Hi all, I hope that this message is not too off-topic, but I want to present to you a seminar that will take place in Rho (MI), Italy, on June 24 2011. The title is Evoluzione dell'alta affidabilità su Linux, it will be a one day seminar which will be focused on the evolution of the Linux clustering. From Heartbeat to Pacemaker, passing by DRBD and Corosync. There will be also a lab part with the creation of an active-active NFS server based on LVM and DRBD. All the project is made by MMUL (http://www.mmul.it) in collaboration with Linbit (http://www.linbit.com, thanks to Florian). I know that the most of you are not Italian, but it maybe a good reason to come and see the Bel Paese. Here are all the informations about the event: http://www.miamammausalinux.org/2011/05/i-seminari-di-mia-mamma-usa-linux-evoluzione-dellalta-affidabilita-su-linux/ It will be totally free, with a brief lunch included, everything offered by MMUL. Let me say that without your every-day help, this event would have never been possible. Have a nice day, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Best way for colocating resource on a dual primary drbd
Il giorno Ven 13 Mag 2011 16:09:14 CET, Viacheslav Biriukov ha scritto: In your case you have two drbd master. So, I think, it is not a good idea to create that collocation. Instead of this you can set location directives to locate vm-test_virtualdomain where you want to be default. For example: location L_vm-test_virtualdomain_01 vm-test_virtualdomain 100: master1.node location L_vm-test_virtualdomain_02 vm-test_virtualdomain 10: master2.node And I agree to your point of view (since I test that the colocation is not working). But the point is: why? I mean, the colocation defines that the drbd device must run in a node where drbd is Master. Why Pacemaker puts drbd in slave on the node in which the migration start? Does a colocation like this: colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: vm-test_virtualdomain vm-test_ms-r0:Master Implicit that once the resource go away it becomes slave? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Best way for colocating resource on a dual primary drbd
Hi all, I've got a setup with a dual primary DRBD with over it a KVM virtual machine, managed by a Virtualdomain resource. In a classical primary-seconday setup, the declaration of the resource is: primitive vm-test_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s ms vm-test_ms-r0 vm-test_r5 \ meta master-max=1 notify=true primitive vm-test_virtualdomain ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/vm-test.xml hypervisor=qemu:///system \ meta allow-migrate=false \ op monitor interval=10 timeout=30 depth=0 \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: vm-test_virtualdomain vm-test_ms-r0:Master order vm-test_virtualdomain_AFTER_vm-test_ms-r0 inf: vm-test_ms-r0:promote vm-test_virtualdomain:start And it's perfectly clear that with this setup I cannot have live migration, so, to change this I made this modifications, following what's written in the dual drbd's documentation on clusterlabs: primitive vm-test_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=20 role=Master timeout=20 \ op monitor interval=30 role=Slave timeout=20 \ op start interval=0 timeout=240s \ op stop interval=0 timeout=100s ms vm-test_ms-r0 vm-test_r0 \ meta notify=true master-max=2 interleave=true primitive vm-test_virtualdomain ocf:heartbeat:VirtualDomain \ params config=/etc/libvirt/qemu/vm-test.xml hypervisor=qemu:///system migration_transport=ssh \ meta allow-migrate=true \ op monitor interval=10 timeout=30 depth=0 \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s order vm-test_virtualdomain_AFTER_vm-test_ms-r0 inf: vm-test_ms-r0:promote vm-test_virtualdomain:start And here are my doubts, because without a colocation, the vm migrate fine, but with a classical colocation like this: colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: vm-test_virtualdomain vm-test_ms-r0:Master things breaks up. Because for some reason (that i don't understand) the destination drbd is promoted to secondary. So, is it correct to not declare a colocation or is there a better way to do what I'm doing? Thanks a lot! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd does not react as expected = split brain
Il giorno Mer 27 Apr 2011 11:11:44 CET, Stallmann, Andreas ha scritto: Hi! I've two cluster-nodes, both running pingd (as a clone), to keep ressources from starting on nodes which have not obvious connection to the network. The ping-nodes are: [...] Any ideas? Thanks for your help, As far as I remember the master suggestion was to use ping instead of pingd, so... Try ping. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pingd does not react as expected = split brain
Il giorno Mer 27 Apr 2011 12:04:57 CET, Stallmann, Andreas ha scritto: Hi! -Ursprüngliche Nachricht- Von: RaSca [mailto:ra...@miamammausalinux.org] Gesendet: Mittwoch, 27. April 2011 11:28 As far as I remember the master suggestion was to use ping instead of pingd, so... Try ping. Allready using ping. See for yourself: primitive pingy_res ocf:pacemaker:ping ... Any other suggestions? TNX, A. Sorry, I didn't look at the configuration. My suggestion is (as Lars already said) to use the fence options of drbd. I've got a setup like yours, with two machines in two different places and after split brain situations I've never had corrupted data. I'm fencing nodes by heartbeat: http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html Good luck! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA
Il giorno Ven 15 Apr 2011 19:33:18 CET, Alessandro Iurlano ha scritto: For the records, thanks to the guys in @linux-ha now I have got a working configuration. The missing point was that OCFS2 and DLM were blocked waiting for a stonith call for the other node to end. In my configuration I had stonith disabled, but this seems not to affect OCFS2 and DLM. So the solution was to enable stonith with a plugin (I tried both meatware and null plugin for testing) and now the cluster seems to behave correctly (as of the first tests). Thanks! Alessandro Great, Alessandro! I'm sorry for being so late in answering to the thread, I've been a little busy. I'm happy to read that @linux-ha is still the best place for making doubts fly away. Have a nice day! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA
Il giorno Sab 02 Apr 2011 19:04:08 CET, Alessandro Iurlano ha scritto: On Fri, Apr 1, 2011 at 11:34 AM, RaScara...@miamammausalinux.org wrote: Then I tried to find a way to keep just the rmtab file synchronized on both nodes. I cannot find a way to have pacemaker do this for me. Is there one? As far as I know, all those operations are handled by the exportfs RA. I believe this was true till the backup part was removed. See the git commit below. So, for some reasons this is not needed anymore, but I don't think this may create problems, surely the RA maintainer has done all the necessary tests. I checked the boot order and indeed I was doing it the wrong way. After I fixed it, a couple of tests worked right away, while the client hanged again when I switched back the cluster to both nodes online. Could you post your working configuration? Thanks, Alessandro Here it is, note that I'm using DRBD instead of a shared storage (basically each drbd is a stand alone export that can reside independently on a node): node ubuntu-nodo1 node ubuntu-nodo2 primitive drbd0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=20s timeout=40s primitive drbd1 ocf:linbit:drbd \ params drbd_resource=r1 \ op monitor interval=20s timeout=40s primitive nfs-kernel-server lsb:nfs-kernel-server \ op monitor interval=10s timeout=30s primitive ping ocf:pacemaker:ping \ params host_list=172.16.0.1 multiplier=100 name=ping \ op monitor interval=20s timeout=60s \ op start interval=0 timeout=60s primitive portmap lsb:portmap \ op monitor interval=10s timeout=30s primitive share-a-exportfs ocf:heartbeat:exportfs \ params directory=/share-a clientspec=172.16.0.0/24 options=rw,async,no_subtree_check,no_root_squash fsid=1 \ op monitor interval=10s timeout=30s \ op start interval=0 timeout=40s \ op stop interval=0 timeout=40s primitive share-a-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/share-a fstype=ext3 options=noatime fast_stop=no \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive share-a-ip ocf:heartbeat:IPaddr2 \ params ip=172.16.0.63 nic=eth0 \ op monitor interval=20s timeout=40s primitive share-b-exportfs ocf:heartbeat:exportfs \ params directory=/share-b clientspec=172.16.0.0/24 options=rw,no_root_squash fsid=2 \ op monitor interval=10s timeout=30s \ op start interval=0 timeout=40s \ op stop interval=0 timeout=40s primitive share-b-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/share-b fstype=ext3 options=noatime fast_stop=no \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s primitive share-b-ip ocf:heartbeat:IPaddr2 \ params ip=172.16.0.64 nic=eth0 \ op monitor interval=20s timeout=40s primitive statd lsb:statd \ op monitor interval=10s timeout=30s group nfs portmap statd nfs-kernel-server group share-a share-a-fs share-a-exportfs share-a-ip group share-b share-b-fs share-b-exportfs share-b-ip ms ms_drbd0 drbd0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_drbd1 drbd1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone nfs_clone nfs \ meta globally-unique=false clone ping_clone ping \ meta globally-unique=false location share-a_on_connected_node share-a \ rule $id=share-a_on_connected_node-rule -inf: not_defined ping or ping lte 0 location share-b_on_connected_node share-b \ rule $id=share-b_on_connected_node-rule -inf: not_defined ping or ping lte 0 colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1301915944 Note that I've grouped all the nfs-server daemons (portmap, nfs-common and nfs-kernel-server) in the cloned group nfs_clone. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA
Il giorno Gio 31 Mar 2011 18:42:17 CET, Alessandro Iurlano ha scritto: Hello. [...] I have tried to put /var/lib/nfs directory on an OCFS2 filesystem shared by both nodes but I had a lot of stability problems with the nfs server processes. In particular they often seems to hang while starting or even stopping. I think this could be because they may be locking some files on the shared filesystems. As the file are kept locked by the daemons, further lock operation may be blocked indefinitely. Having a shared /var/lib/nfs make sense with active-standby configurations because the nfs servers does not talk with each other, so I think it's normal to have instability. Then I tried to find a way to keep just the rmtab file synchronized on both nodes. I cannot find a way to have pacemaker do this for me. Is there one? As far as I know, all those operations are handled by the exportfs RA. Also, I have found that the exportfs RA originally had a mechanism to keep rmtab synchronized but it has been removed in this commit: https://github.com/ClusterLabs/resource-agents/commit/0edb009a87f0d47b310998f2cb3809d2775e2de8 Is there another way to accomplish this active/active setup? I've configured a lot of A/A setup with exportfs without having problems. What's the boot order of your resources? Are you sure you're removing first of all the IP and then exportfs? I'm asking this because one of my problem was about the connections keep opened by the clients (i was removing first exportfs). -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Sort of crm commandes but off line ?
Il giorno Gio 24 Mar 2011 14:32:09 CET, Alain.Moulle ha scritto: Hi, Ok I think my question was not clear : in fact, the pb is not to do or not sshnode crm ... , the pb is just to know the hostname of the node to ssh it , in another way than parsing the cib.xml to know which other nodes are in the same HA cluster as the node where I am (knowing that corosync is stopped on this local node) . Thanks Regards. Alain This might sound obvious but is an ssh call acceptable? What about: cat /var/lib/heartbeat/crm/cib.xml|grep node id|sed -n 's/.*uname=\\(.*\)\.*/\1/p' ? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Sort of crm commandes but off line ?
Il giorno Gio 24 Mar 2011 15:11:57 CET, Alain.Moulle ha scritto: Thanks but that's a search in cib.xml ... I'll already have a solution with xpath for that: xpath /var/lib/heartbeat/crm/cib.xml /cib/configuration/nodes/node/@uname 2/dev/null etc. My question was slightly different ... but anyway, I'll parse cib.xml Thanks a lot. Regards Alain Sorry, I don't think I really understood what do you want to obtain, but if a node is not connected to tha cluster, then i don't think it's possible to look for this kind of information anywhere else than checking the xml. What about creating a shared area (external to the cluster, like an nfs mount) in which the cluster (with crm_mon resource agent) puts the cluster state (in html, for example)? If a node is not connected to the cluster, than it should simply look in this area, parse the html (or txt, or whatever) and then understand who is in the cluster... This should be real time informations, and not (possible) old xml file... Hope this helps, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Mar 22 Mar 2011 10:32:10 CET, Christoph Bartoschek ha scritto: [...] The linbit.com document Highly available NFS storage with DRBD and Pacemaker suggests to use the lsb:nfs-kernel-server resource while the wiki suggests ocf:heartbeat:nfsserver. Which one is the better one and what are the advantages? Christoph exportfs must not be used with nfsserver: they make the same things in two different ways. The first manages the exports on an already active nfs server, the second manages the nfs process (and here you must have a shared /var/lib/nfs between nodes). So the choice is yours. I personally prefer exportfs because is a little bit simple and all you have to do is configure the primitive. There is also another one big thing that exportfs have and nfsserver not: suppose you want to create a cluster nfs server in which every node shares an export: this is very simple with exportfs, but basically impossible with nfsserver (unless you split you configuration, your nfs daemon and so on). Have a nice day. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Mar 22 Mar 2011 11:01:25 CET, Caspar Smit ha scritto: I basically had the same question as Christopher :) Thanks for the great clarification RaSca! I'll go for exportfs and lsb:nfs-kernel-server for sure. Kind regards, Caspar Smit One is glad to be of service :-) In my experience, I always create an nfs group composed by portmap, nfs-common and nfs-kernel-server (all lsb resources) and cloned in every node. Then I create the sub groups of single exports composed by the share (local, iscsi or whatever), exportfs and virtual-ip. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Mar 22 Mar 2011 12:02:18 CET, Caspar Smit ha scritto: [...] Thanks for this tip, why would the linbit.com document not mention nfs-common and portmap? Are these only needed in specific situations or are they always needed(I'm using debian lenny)? I think that this is because those are considered system services, something that is always active (if you want to mount nfs resources you need both portmap and nfs-common). And do cloned resources (portmap, nfs-common and nfs-kernel-server) really have to be in a group? (Since they are cloned and the service runs on all nodes anyway) Kind regards, Caspar I made groups just to preserve a logical order: first portmap, then nfs-common and then nfs-kernel-server. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Mar 22 Mar 2011 13:50:19 CET, Caspar Smit ha scritto: Could you maybe post (a snippet of) your crm configuration (the part concerning NFS)? This could be a great help for other users as well I think. Thanks in advance Kind regards, Caspar Smit Hi Caspar, all my experiences are inside the article I wrote for my technical portal miamammausalinux.org: http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/ Here you will find everything you need, the only problem is that all is written in Italian... Of course, translators are welcome ;-) -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Mar 22 Mar 2011 15:11:34 CET, Caspar Smit ha scritto: Thanks for this, but I don't see any portmap, nfs-common and/or nfs-kernel-server lsb scripts used in that article. That was just the part I was interested in :) Kind regards Caspar Smit You're absolutely right :-) Take a look at this configuration: node ubuntu-nodo1 node ubuntu-nodo2 primitive drbd0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=20s timeout=40s \ meta target-role=started primitive drbd1 ocf:linbit:drbd \ params drbd_resource=r1 \ op monitor interval=20s timeout=40s primitive nfs-kernel-server lsb:nfs-kernel-server \ op monitor interval=10s timeout=30s primitive ping ocf:pacemaker:ping \ params host_list=172.16.0.1 multiplier=100 name=ping \ op monitor interval=20s timeout=60s \ op start interval=0 timeout=60s primitive portmap lsb:portmap \ op monitor interval=10s timeout=30s primitive share-a-exportfs ocf:heartbeat:exportfs \ params directory=/share-a clientspec=172.16.0.0/24 options=rw,async,no_subtree_check,no_root_squash fsid=1 \ op monitor interval=10s timeout=30s \ op start interval=0 timeout=40s \ op stop interval=0 timeout=40s \ meta is-managed=true target-role=started primitive share-a-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/share-a fstype=ext3 options=noatime fast_stop=no \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s \ meta is-managed=true target-role=started primitive share-a-ip ocf:heartbeat:IPaddr2 \ params ip=172.16.0.63 nic=eth0 \ op monitor interval=20s timeout=40s \ meta is-managed=true target-role=started primitive share-b-exportfs ocf:heartbeat:exportfs \ params directory=/share-b clientspec=172.16.0.0/24 options=rw,no_root_squash fsid=2 \ op monitor interval=10s timeout=30s \ op start interval=0 timeout=40s \ op stop interval=0 timeout=40s \ meta target-role=started primitive share-b-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/share-b fstype=ext3 options=noatime fast_stop=no \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s \ meta target-role=started primitive share-b-ip ocf:heartbeat:IPaddr2 \ params ip=172.16.0.64 nic=eth0 \ op monitor interval=20s timeout=40s \ meta target-role=started primitive statd lsb:statd \ op monitor interval=10s timeout=30s group nfs portmap statd nfs-kernel-server group share-a share-a-fs share-a-exportfs share-a-ip group share-b share-b-fs share-b-exportfs share-b-ip ms ms_drbd0 drbd0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true ms ms_drbd1 drbd1 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone nfs_clone nfs \ meta globally-unique=false clone ping_clone ping \ meta globally-unique=false location share-a_on_connected_node share-a \ rule $id=share-a_on_connected_node-rule -inf: not_defined ping or ping lte 0 location share-b_on_connected_node share-b \ rule $id=share-b_on_connected_node-rule -inf: not_defined ping or ping lte 0 colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1300276063 I omitted the nfs server part in my article because explaining also that part would make the article even longer... Bye, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failover NFS using Pacemaker.
Il giorno Lun 21 Mar 2011 16:27:35 CET, Caspar Smit ha scritto: Hi, [...] In this document there is no mention of the /var/lib/nfs directory but in stead a new resource agent (exportfs) Does this exportfs resource agent deprecate the need for a shared /var/lib/nfs or do I still need to do that? Hi, no, exportfs automatically creates just two hidden files in which are stored the exportfs process pid and the rmtab informations. ps. What about the nfsserver resource agent? Will I need that too? You will need to have an NFS server running. exportfs RA populates the system exportfs table. It's your choice to put nfs server under the control of the cluster (maybe with a cloned resource group composed by portmap, nfs-common and nfs-kernel-server) or not. Bye, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration
Il giorno Ven 11 Mar 2011 07:32:32 CET, Randy Katz ha scritto: ps - in /var/log/messages I find this: Mar 10 22:31:45 drbd1 lrmd: [3274]: ERROR: get_resource_meta: pclose failed: Interrupted system call Mar 10 22:31:45 drbd1 lrmd: [3274]: WARN: on_msg_get_metadata: empty metadata for ocf::linbit::drbd. Mar 10 22:31:45 drbd1 lrmadmin: [3481]: ERROR: lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply message of rmetadata with function get_ret_from_msg. [...] Hi, I think that the message no such resource agent is explaining what's the matter. Does the file /usr/lib/ocf/resource.d/linbit/drbd exists? Is the drbd file executable? Have you correctly installed the drbd packages? Check those things, you can try to reinstall drbd. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration
Il giorno Ven 11 Mar 2011 10:36:25 CET, Randy Katz ha scritto: [...] Hi # ls -l /usr/lib/ocf/resource.d/linbit/drbd -rwxr-xr-x 1 root root 24523 Jun 4 2010 /usr/lib/ocf/resource.d/linbit/drbd DRBD is running fine, I have setup that part of it already. I am using the ha-scsi.pdf and up to this point everything is fine. Randy This is a little bit strange. If you are sure about the drbd setup than the no such resource agent error should not be present. What is your pacemaker version? It might be a bug (on google there are some cases of this kind of problems that are bugs). You can even try to run the resource agent manually, by going into the /usr/lib/ocf/resource.d/linbit/ and setting the environmental variables needed, and see what happens. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration
Il giorno Ven 11 Mar 2011 11:15:03 CET, RaSca ha scritto: [...] This is a little bit strange. If you are sure about the drbd setup than the no such resource agent error should not be present. What is your pacemaker version? It might be a bug (on google there are some cases of this kind of problems that are bugs). You can even try to run the resource agent manually, by going into the /usr/lib/ocf/resource.d/linbit/ and setting the environmental variables needed, and see what happens. Another thing... Have you tried declaring also the Master-slave resource for that drbd and see what happens? -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Mar 18 Gen 2011 12:06:36 CET, Dejan Muhamedagic ha scritto: [...] Good. Thanks for investigating. As we've already discussed yesterday in irc, the iscsi patch I attached last week contained a bug which has been fixed now in the repository. Perhaps that bug caused the further problems which you experienced, so it doesn't actually have to do anything with iscsiadm discovery. Can you please test that. [...] Finally I've tested it. It works. So you can choose now to include or not the discovery selection patch I made. But at this time I can confirm that the latest iscsi resource agent works great. Have a great day! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with dependent groups at cluster startup
Il giorno Mer 19 Gen 2011 15:55:30 CET, Andrew Beekhof ha scritto: [...] Not sure where you are with this, but these logs indicate that some things are already running (rc=0) when the cluster starts: Jan 11 15:11:52 SE4 crmd: [32095]: info: match_graph_event: Action db_iscsi-lun_monitor_0 (7) confirmed on se4 (rc=0) Jan 11 15:11:52 SE4 crmd: [32095]: info: match_graph_event: Action db_iscsi-target_monitor_0 (6) confirmed on se4 (rc=0) This may create the impression that pacemaker started things in the wrong order, even though it didn't. As discussed with Dejan we find out a solution. There's something with the discovery that fails, but now the RA is patched. Look at the thread Here again with my problem with iscsi resource agent. Thanks a lot for your help! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Is 'resource_set' still experimental?
Il giorno Mar 18 Gen 2011 11:20:13 CET, Florian Haas ha scritto: On 01/04/2011 11:56 AM, Tobias Appel wrote: On 12/28/2010 06:46 PM, Dejan Muhamedagic wrote: 40 order constraints? A big cluster. We have currently 40 VM's (XEN) on it. I can't put them in a group since they have to run independently and not necessarily on the same node(s). And setting meta ordered=false colocated=false on the group is not an option? Florian As discussed yesterday on IRC with Andrew, there is no way of creating a group with indipendent resources. I was hoping that setting the options you mentioned can do the trick, but I've just tested: If you declare a group like this: group groupA resA resB resC meta ordered=false colocated=false and then you do a: crm resource stop resB, then resC is also stopped. So the only way for make this setup work is to declare an order+colocation set with parenthesis. Please correct me if I'm wrong or if I've misunderstood what you wrote. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] heartbeat moves the resources when heartbeat starts on a second node
Il giorno Mar 18 Gen 2011 12:13:15 CET, Erik Dobák ha scritto: Hi people, i got my active/passive cluster running. when i start the first node all resources are started. but when i start the second node, all resources are stoped on the first node and started on the second node. why? do i something wrong? [...] cheers E You have to take a look to resource stickiness and placement, it's all covered in the configuration explained docs. Bye, -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] no quorum problem
Il giorno Mar 18 Gen 2011 16:11:13 CET, Pavlos Polianidis ha scritto: Dear Andrew So is there any solution to make the quorum operate? Thanks in advance Pavlos Polianidis How your no-quorum-policy is set? -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Ven 14 Gen 2011 17:34:10 CET, RaSca ha scritto: [...] I can say for sure that we will surely know. As you can see, in all of my posts (and projects) I never give up until there is a clear solution to the problem. I will find out also in this case. Thanks again, Dejan, here I am. Attached to this mail you can find a patch to your original iscsi RA, in which I've added the possibility to do or do not the discovery (discovery_enable parameter). In this way, the RA works perfectly and I can use it in my environment. As I first supposed there's something in the code of the discovery that make things break. Note that I will continue my investigation on the discover problem, but in this way it's possible for me to use the original RA with my patch and, of course, the parameter discovery_enable set to no. Maybe, since this is an optional parameter, it can be included in the official RA. Hoping that you're unhappy anymore ;-) -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org *** heartbeat/iscsi 2011-01-16 13:26:12.0 +0100 --- rasca/iscsi 2011-01-16 13:27:57.0 +0100 *** *** 31,36 --- 31,37 # OCF_RESKEY_portal: the iSCSI portal address or host name (required) # OCF_RESKEY_target: the iSCSI target (required) # OCF_RESKEY_iscsiadm: iscsiadm program path (optional) + # OCF_RESKEY_discovery_enable: enable discovery? (default: yes) # OCF_RESKEY_discovery_type: discovery type (optional; default: sendtargets) # # Initialization: *** *** 87,92 --- 88,101 content type=string default= / /parameter + parameter name=discovery_enable unique=0 required=0 + longdesc lang=en + Enable discovery? In some cases doing the discovery can break things on startup. + /longdesc + shortdesc lang=endiscovery_enable/shortdesc + content type=string default=yes / + /parameter + parameter name=discovery_type unique=0 required=0 longdesc lang=en Discovery type. Currently, with open-iscsi, only the sendtargets *** *** 179,187 # 3: iscsiadm returned error open_iscsi_discovery() { ! output=`$iscsiadm -m discovery -p $OCF_RESKEY_portal -t $discovery_type` if [ $? -ne 0 -o x = x$output ]; then ! [ x != x$output ] echo $output return 3 fi portal=`echo $output | --- 188,205 # 3: iscsiadm returned error open_iscsi_discovery() { ! local output ! local portal ! local severity=err ! local cmd=$iscsiadm -m discovery -p $OCF_RESKEY_portal -t $discovery_type ! ! ocf_is_probe severity=info ! output=`$cmd` if [ $? -ne 0 -o x = x$output ]; then ! [ x != x$output ] { ! ocf_log $severity $cmd FAILED ! echo $output ! } return 3 fi portal=`echo $output | *** *** 196,202 case `echo $portal | wc -w` in 0) #target not found echo $output ! ocf_log err target $OCF_RESKEY_target not found at portal $OCF_RESKEY_portal return 1 ;; 1) #we're ok --- 214,220 case `echo $portal | wc -w` in 0) #target not found echo $output ! ocf_log $severity target $OCF_RESKEY_target not found at portal $OCF_RESKEY_portal return 1 ;; 1) #we're ok *** *** 336,343 --- 354,366 exit $OCF_ERR_PERM fi + discovery_enable=${OCF_RESKEY_discovery_enable:-yes} + [ $discovery_enable != yes ] portal=$OCF_RESKEY_portal discovery_type=${OCF_RESKEY_discovery_type:-sendtargets} udev=${OCF_RESKEY_udev:-yes} + + if [ $discovery_enable = yes ] + then $discovery # discover and setup the real portal string (address) case $? in 0) ;; *** *** 346,353 [ $1 = status ] exit $LSB_STATUS_STOPPED exit $OCF_ERR_GENERIC ;; ! [23]) exit $OCF_ERR_GENERIC;; esac # which method was invoked? case $1 in --- 369,380 [ $1 = status ] exit $LSB_STATUS_STOPPED exit $OCF_ERR_GENERIC ;; ! 2) exit $OCF_ERR_GENERIC;; ! 3) ocf_is_probe exit $OCF_NOT_RUNNING !exit $OCF_ERR_GENERIC ! ;; esac + fi # which method was invoked? case $1 in ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Ven 14 Gen 2011 12:18:38 CET, Dejan Muhamedagic ha scritto: [...] iscsiadm fails with the following message: Jan 13 19:15:46 debian-squeeze-nodo1 lrmd: [1274]: info: RA output: (www_db-iscsi:start:stderr) iscsiadm: Jan 13 19:15:46 debian-squeeze-nodo1 lrmd: [1274]: info: RA output: (www_db-iscsi:start:stderr) no records found! Try to figure out what does that mean, my iscsi is a bit rusty. Hey Dej, finally I made things working by rewriting the resource agent. I was unable to do a true debug, so I cleaned up (for me, of course) your code. You can find it attached to this mail. It works perfectly in my topology, but it has some limitations: - It runs only on Linux (while your original was meant also for different situations); - It runs only with udev (I don't make any udev check like in the original RA); - It's obviously a scratch; Hope this can help improving the things in some way. Let me know if I can do something else. Thanks a lot for your precious support! -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org #!/bin/sh # # iSCSI OCF resource agent # Description: manage iSCSI disks (add/remove) using open-iscsi # # Developed by RaSca (ra...@miamammausalinux.org) # Copyright (C) 2010 MMUL S.a.S., All Rights Reserved. # Based upon the Resource Agent iscsi by Dejan Muhamedagic de...@suse.de # # This program is free software; you can redistribute it and/or modify # it under the terms of version 2 of the GNU General Public License as # published by the Free Software Foundation. # # This program is distributed in the hope that it would be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # Further, this software is distributed without any warranty that it is # free of the rightful claim of any third person regarding infringement # or the like. Any license provided herein, whether implied or # otherwise, applies only to this software file. Patent licenses, if # any, provided herein do not apply to combinations of this program with # other software, or any other product whatsoever. # # You should have received a copy of the GNU General Public License # along with this program; if not, write the Free Software Foundation, # Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. # # See usage() and meta_data() below for more details... # # OCF instance parameters: # OCF_RESKEY_portal: the iSCSI portal address or host name (required) # OCF_RESKEY_target: the iSCSI target (required) # OCF_RESKEY_iscsiadm: iscsiadm program path (optional) # # Initialization: : ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat} . ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs usage() { methods=`iscsi_methods` methods=`echo $methods | tr ' ' '|'` cat -! usage: $0 {$methods} $0 manages an iSCSI target The 'start' operation starts (adds) the iSCSI target. The 'stop' operation stops (removes) the iSCSI target. The 'status' operation reports whether the iSCSI target is connected The 'monitor' operation reports whether the iSCSI target is connected The 'validate-all' operation reports whether the parameters are valid The 'methods' operation reports on the methods $0 supports ! } meta_data() { cat END ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=iscsi version1.0/version longdesc lang=en OCF Resource Agent for iSCSI. Add (start) or remove (stop) iSCSI targets. /longdesc shortdesc lang=enManages a local iSCSI initiator and its connections to iSCSI targets/shortdesc parameters parameter name=portal unique=0 required=1 longdesc lang=en The iSCSI portal address in the form: {ip_address|hostname}[:port] /longdesc shortdesc lang=enportal/shortdesc content type=string default= / /parameter parameter name=target unique=1 required=1 longdesc lang=en The iSCSI target. /longdesc shortdesc lang=entarget/shortdesc content type=string default= / /parameter parameter name=iscsiadm unique=0 required=0 longdesc lang=en iscsiadm program path. /longdesc shortdesc lang=eniscsiadm/shortdesc content type=string default= / /parameter /parameters actions action name=start timeout=120 / action name=stop timeout=120 / action name=status timeout=30 / action name=monitor depth=0 timeout=30 interval=120 / action name=validate-all timeout=5 / action name=methods timeout=5 / action name=meta-data timeout=5 / /actions /resource-agent END } open_iscsi_methods() { cat -! start stop status monitor validate-all methods meta-data usage ! } open_iscsi_daemon() { if ps -e -o cmd | grep -qs '[i]scsid'; then return 0 else ocf_log err iscsid not running; please start open-iscsi utilities return 1 fi
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Ven 14 Gen 2011 16:07:42 CET, Dejan Muhamedagic ha scritto: [...] Unfortunately, it cannot really improve situation in any way. Dropping half the agent and not knowing where is the problem is not helpful. At least not if you want to share your findings with the community. I perfectly know, but my goal for now was to solve my problem, as I repeated (too) many times in these days, I was in trouble with a customer and simplifying the script was the first step to understand how this was made. One of the possible explanations is that the target simply wasn't ready at the time it tried to start (as it has only be made available by the previous service). So, perhaps inserting a Delay resource in between could help here. If so, then maybe iSCSI* agents need to make sure that the service is really ready after start exits. Consider that I have reproduced a virtual environment that can be used for doing all of the tests needed. I don't think that delay is the solution, but I agree with you that maybe reviewing the discovery code issue can help us. Looking at the diff, though it's simply impossible to figure out what changed because so many things did, it looks like you dropped the discovery code (iscsiadm -m discovery). Was that where you had problems? Well, we'll probably never know for sure. Unhappy, Dejan I can say for sure that we will surely know. As you can see, in all of my posts (and projects) I never give up until there is a clear solution to the problem. I will find out also in this case. Thanks again, -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Gio 13 Gen 2011 13:22:26 CET, Dejan Muhamedagic ha scritto: Hi, Hi Dej and thanks as usual for you precious support. [...] iscsiadm fails in probe with the following messages: Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) cannot make connection to 10.0.0.100:3260 (113) Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: connection to discovery address 10.0.0.100 failed Jan 13 12:15:22 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 (113)#012iscsiadm: connection to discovery address 10.0.0.100 failed Jan 13 12:15:26 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 (113)#012iscsiadm: connection to discovery address 10.0.0.100 failed Jan 13 12:15:29 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 (113)#012iscsiadm: connection to discovery address 10.0.0.100 failed Jan 13 12:15:33 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: (www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 (113)#012iscsiadm: connection to discovery address 10.0.0.100 failed#012iscsiadm: connection login retries (reopen_max) 5 exceeded#012iscsiadm: Could not perform SendTargets discovery. The ip (10.0.0.100) is not up yet at that point. Are you sure that your configuration is sane. Looks somewhat strange to me. My configuration looks to me (!) fine. As I said, the problems comes up only in the startup and not with normal failovers. What particularly is strange to you? Anyway, iscsi (the RA) should be more forgiving in probes. Can you please try the attached patch. The patch applies, but the resource fails to startup with: Failed actions: www_db-iscsi_monitor_0 (node=debian-squeeze-nodo1, call=39, rc=4, status=complete): insufficient privileges the current log is attached, even if I try a cleanup it ends again with the insufficient privilege message. Hmm, couldn't find that in the logs. The first action on the iscsi resource is probe and that fails. Thanks, Dejan And this is the strange thing for me: Why iscsi is probed if db group is not yet active? -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org Jan 13 14:37:19 debian-squeeze-nodo1 cibadmin: [1601]: info: Invoked: cibadmin -Ql -o resources Jan 13 14:37:19 debian-squeeze-nodo1 cibadmin: [1603]: info: Invoked: cibadmin -Ql -o resources Jan 13 14:37:19 debian-squeeze-nodo1 crm_resource: [1605]: info: Invoked: crm_resource -C -r www -H debian-squeeze-nodo1 Jan 13 14:37:19 debian-squeeze-nodo1 crm_resource: [1605]: ERROR: unpack_rsc_op: Hard error - www_db-iscsi_monitor_0 failed with rc=4: Preventing www_db-iscsi from re-starting on debian-squeeze-nodo1 Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: Removing resource www_db-iscsi from the LRM Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: Resource 'www_db-iscsi' deleted for 1605_crm_resource on debian-squeeze-nodo1 Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: notify_deleted: Notifying 1605_crm_resource on debian-squeeze-nodo1 that www_db-iscsi was deleted Jan 13 14:37:19 debian-squeeze-nodo1 cib: [30073]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='debian-squeeze-nodo1']//lrm_resource[@id='www_db-iscsi'] (origin=local/crmd/175, version=0.303.9): ok (rc=0) Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: send_direct_ack: ACK'ing resource op www_db-iscsi_delete_6 from 0:0:crm-resource-1605: lrm_invoke-lrmd-1294925839-103 Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: abort_transition_graph: te_update_diff:267 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=www_db-iscsi_monitor_0, magic=0:4;8:14:7:63667b9f-9576-4a57-b9b9-1f53d01ca8b7, cib=0.303.9) : Resource op removal Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_pe_invoke: Query 178: Requesting the current CIB: S_POLICY_ENGINE Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: Removing resource www_db-fs from the LRM Jan 13 14:37:19
Re: [Linux-HA] Here again with my problem with iscsi resource agent
Il giorno Gio 13 Gen 2011 13:57:28 CET, Dejan Muhamedagic ha scritto: Hi, [...] The patch applies, but the resource fails to startup with: Failed actions: www_db-iscsi_monitor_0 (node=debian-squeeze-nodo1, call=39, rc=4, status=complete): insufficient privileges +/usr/lib/ocf/resource.d//heartbeat/iscsi: Permission denied Try chmod +x /usr/lib/ocf/resource.d//heartbeat/iscsi :) Shame on me. I'm an idiot :) Now seems to work. On the log I can see some message like this: Jan 13 15:56:09 debian-squeeze-nodo1 iscsid: connect to 10.0.0.100:3260 failed (No route to host) until the db resource comes up, then the iscsi resource comes up correctly. But now there's another problem with the resource next to this one: the first time the filesystem comes up, it fails, with this error: Jan 13 16:13:20 debian-squeeze-nodo1 Filesystem[8207]: [8250]: INFO: Running start for /dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1 on / db Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr) FATAL: Module scsi_hostadapter not found. Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770010] sd 19:0:0:1: [sdc] Unhandled error code Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770056] sd 19:0:0:1: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770123] sd 19:0:0:1: [sdc] CDB: Read(10): 28 00 00 00 00 3e 00 00 02 00 Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770361] end_request: I/O error, dev sdc, sector 62 Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.771595] EXT3-fs: unable to read superblock Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr) mount: wrong fs type, bad option, bad superblock on /dev/sdc1,#012 missing codepage or helpe r program, or other error Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr) Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr)In some cases useful info is found in syslog - try#012 dmesg | tail or so Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr) Jan 13 16:13:20 debian-squeeze-nodo1 Filesystem[8207]: [8266]: ERROR: Couldn't mount filesystem /dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-p art1 on /db Jan 13 16:13:20 debian-squeeze-nodo1 crmd: [7400]: info: process_lrm_event: LRM operation www_db-fs_start_0 (call=32, rc=1, cib-update=57, confirmed=true) unknown error But, even if the system says that FATAL: Module scsi_hostadapter not found., if I do a cleanup of the resource it comes up without other problems: Jan 13 16:20:26 debian-squeeze-nodo1 Filesystem[11389]: [11444]: INFO: Running start for /dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1 on /db Jan 13 16:20:26 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: (www_db-fs:start:stderr) FATAL: Module scsi_hostadapter not found. Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.470761] kjournald starting. Commit interval 5 seconds Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.478976] EXT3 FS on sdc1, internal journal Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.479074] EXT3-fs: mounted filesystem with ordered data mode. So everything is ok. The filesystem resource is declared in this way: primitive www_db-fs ocf:heartbeat:Filesystem \ params device=/dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1 directory=/db fstype=ext3 \ op monitor interval=20s timeout=40s \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s What should be the problem? [...] All resources are probed at startup regardless of dependencies. It's up to the resource agents to manage such situations. Thanks, Dejan Ok, now it's clear. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Problem with dependent groups at cluster startup
Hi all, I've got two group of resources, say A and B. A depends on B, so if B isn't up A must NOT be started. This is the situation: group B Bres1 Bres2 Bres3 colocation B_ON_B_ms-r1 inf: B B_ms-r1:Master order B_AFTER_B_ms-r1 inf: B_ms-r1:promote B:start group A Ares1 Ares2 Ares3 colocation A_ON_A_ms-r0 inf: A A_ms-r0:Master order A_AFTER_A_ms-r0 inf: A_ms-r0:promote A:start order A_AFTER_B inf: B:start A:start In a running configuration everything works fine, all the services switch in case of failures or if I force a manual move. The problem is on startup, because for some reason Pacemaker (1.0.10) start first of all Ares1, that of course fails because B is not started. I think that the last order constraint must obligate A to not start before B is done, but things are not going in this way. What am I missing? Thanks a lot! -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with dependent groups at cluster startup
Il giorno Mar 11 Gen 2011 13:36:58 CET, RaSca ha scritto: Hi all, I've got two group of resources, say A and B. A depends on B, so if B isn't up A must NOT be started. This is the situation: group B Bres1 Bres2 Bres3 colocation B_ON_B_ms-r1 inf: B B_ms-r1:Master order B_AFTER_B_ms-r1 inf: B_ms-r1:promote B:start group A Ares1 Ares2 Ares3 colocation A_ON_A_ms-r0 inf: A A_ms-r0:Master order A_AFTER_A_ms-r0 inf: A_ms-r0:promote A:start order A_AFTER_B inf: B:start A:start In a running configuration everything works fine, all the services switch in case of failures or if I force a manual move. The problem is on startup, because for some reason Pacemaker (1.0.10) start first of all Ares1, that of course fails because B is not started. I think that the last order constraint must obligate A to not start before B is done, but things are not going in this way. What am I missing? Thanks a lot! To be more specifically, this is the log of what happens when I startup just one node: http://pastebin.com/YsP4B94r and this is my configuration: http://pastebin.com/77gwm1Gr -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ha for 2 jboss instancies in an active/passive cluster
Il giorno Mar 04 Gen 2011 12:13:48 CET, Erik Dobák ha scritto: [...] i have found this http://www.linux-ha.org/doc/re-ra-jboss.html but am simply not able to understand where this: primitive example_jboss ocf:heartbeat:jboss \ params \ jboss_home=*string* \ op monitor depth=0 timeout=30s interval=10s belongs. i would be thankfull for some better jboss example or a few hints. E First of all: are you using Pacemaker or not? If you're using Heartbeat without CRM, then you might consider to upgrade to Pacemaker or to create two different init scripts for each jboss instance and then configure them into /etc/ha.d/haresources. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Dependent groups of resources that can reside on different nodes
Il giorno Lun 20 Dic 2010 13:20:42 CET, RaSca ha scritto: [...] What am I missing? As discussed with Andrew on IRC the problem is fixed by removing the parenthesis from the order and colocation's declarations. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Dependent groups of resources that can reside on different nodes
Hi all, I've got two group of resources, say A and B. Suppose this is the situation: group A: colocation B_ON_B_ms-r1 inf: ( B_ip ) ( B_iscsitarget-lun ) ( B_iscsitarget-export ) B_ms-r1:Master order B_AFTER_B_ms-r1 inf: B_ms-r1:promote ( B_iscsitarget-export:start ) ( B_iscsitarget-lun:start ) ( B_ip:start ) group B: colocation A_ON_A_ms-r0 inf: ( A_ip ) ( A_fs ) ( A_db-fs ) ( A_iscsi-db ) A_ms-r0:Master order A_AFTER_A_ms-r0 inf: A_ms-r0:promote ( A_iscsi-db:start ) ( A_db-fs:start ) ( A_fs:start ) ( A_ip:start ) The point is that both resources can reside on different nodes, but A depends on B, so I want that if B switch, then A must be stopped, then B must be started and after that A is restarted. I've declared this order constraint, but it doesn't do what I'm expecting: order A_after_B inf: ( B_iscsitarget-export:start ) ( B_iscsitarget-lun:start ) ( B_ip:start ) ( A_iscsi-db:start ) ( A_db-fs:start ) ( A_fs:start ) ( A_ip:start ) What am I missing? Thanks a lot! -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] custom jboss init script on pacemaker
Il giorno Mar 30 Nov 2010 11:38:39 CET, Michael Kromer ha scritto: Hi, I've never seen a real LSB-conform init script of jboss, but the one getting real close I know is http://www.riccardoriva.com/shared-files/jboss_init_script You might need to strip out the oracle stuff defined in there, but it should be a good starting point, as it handles status) the way LSB asks for it. - mike I can confirm that the RA shipped with pacemaker 1.0.9 and 1.0.10 works great for me too. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] custom jboss init script on pacemaker
Il giorno Mar 30 Nov 2010 11:55:50 CET, Michael Kromer ha scritto: right, for reference: http://www.linux-ha.org/doc/re-ra-jboss.html I just recommend to take a safe look at the timeouts, as 60s could be too short for some larger applications. - mike I confirm. I had to put 240s in some cases, jboss sometimes is very slow. -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Anything resource agent and workdir
Hi Guys, working with some java batch programs I needed to configure the anything resource agent. I found that there's no way to define the working directory from where the executable must be launched. So I solved my problem patching the resource agent to support the workdir parameter, by writing this patch: I don't know if this is the best solution (any suggestion will be appreciated), but for me it worked, so I share it (you can find it attached). Bye, -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! mobile: +393281776712 ra...@miamammausalinux.org http://www.miamammausalinux.org --- ../heartbeat/anything 2010-07-15 11:26:18.0 +0200 +++ anything 2010-10-29 11:04:44.0 +0200 @@ -27,6 +27,7 @@ # OCF instance parameters # OCF_RESKEY_binfile # OCF_RESKEY_cmdline_options +# OCF_RESKEY_workdir # OCF_RESKEY_pidfile # OCF_RESKEY_logfile # OCF_RESKEY_errlogfile @@ -34,7 +35,7 @@ # OCF_RESKEY_monitor_hook # OCF_RESKEY_stop_timeout # -# This RA starts $binfile with $cmdline_options as $user and writes a $pidfile from that. +# This RA starts $binfile with $cmdline_options as $user in $workdir and writes a $pidfile from that. # If you want it to, it logs: # - stdout to $logfile, stderr to $errlogfile or # - stdout and stderr to $logfile @@ -74,14 +75,14 @@ if [ -n $logfile -a -n $errlogfile ] then # We have logfile and errlogfile, so redirect STDOUT und STDERR to different files - cmd=su - $user -c \nohup $binfile $cmdline_options $logfile 2 $errlogfile \'echo \$!' + cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options $logfile 2 $errlogfile \'echo \$!' else if [ -n $logfile ] then # We only have logfile so redirect STDOUT and STDERR to the same file -cmd=su - $user -c \nohup $binfile $cmdline_options $logfile 21 \'echo \$!' +cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options $logfile 21 \'echo \$!' else # We have neither logfile nor errlogfile, so we're not going to redirect anything -cmd=su - $user -c \nohup $binfile $cmdline_options \'echo \$!' +cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options \'echo \$!' fi fi ocf_log debug Starting $process: $cmd @@ -169,6 +170,7 @@ process=$OCF_RESOURCE_INSTANCE binfile=$OCF_RESKEY_binfile cmdline_options=$OCF_RESKEY_cmdline_options +workdir=$OCF_RESKEY_workdir pidfile=$OCF_RESKEY_pidfile [ -z $pidfile ] pidfile=${HA_VARRUN}/anything_${process}.pid logfile=$OCF_RESKEY_logfile @@ -225,6 +227,13 @@ shortdesc lang=enCommand line options/shortdesc content type=string / /parameter +parameter name=workdir required=1 unique=1 +longdesc lang=en +The path from where the binfile will be executed. +/longdesc +shortdesc lang=enFull path name of the executable directory/shortdesc +content type=string default=/ +/parameter parameter name=pidfile longdesc lang=en File to read/write the PID from/to. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HELP - debugging a hanging domU boot?
Il giorno Gio 28 Ott 2010 14:45:32 CET, Miles Fidelman ha scritto: [...] Any suggestions? I found a similar problem by using the wrong console: http://groups.google.com/group/ganeti/browse_thread/thread/639f297c738e5adb I was using ganeti, but it can't be much different. Bye, -- Raoul Scarazzini Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! mobile: +393281776712 ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Handling colocation constraints with more than 2 entries
Il giorno Mer 06 Ott 2010 12:34:22 CET, Dejan Muhamedagic ha scritto: [...] There's also role change which forces a break in the set. There was posted an example like this to the list: (a) collocation c1 inf: ms-r0:Master fs jboss or, expressed as a chain of two-resource collocations: collocation c1_1 inf: jboss fs collocation c1_2 inf: fs ms-r0:Master Now, the resource set would have to be split in two because of the role change, but that would also force us to move stuff around to preserve semantics: (b) collocation c1 inf: [ fs jboss ] [ ms-r0:Master ] Because adjacent resource sets have the same semantics as two-resource collocations, right? The case when the role change is in the middle: (a) collocation c1 inf: A B:Master C becomes (b) collocation c1 inf: [ C ] [ B:Master ] [ A ] All this strikes me as suboptimal, but I'm not sure what is to be done about it. Basically, users should be able to type the (a) versions and let the software handle the rest. Opinions? Cheers, Dejan I totally agree with you Dej. Note that for making things work and keep things compatible an automated management of the sets like this one is IMHO the best way. Glad to see that my problems created such a discussion! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Trying to understand sets
Il giorno Mar 05 Ott 2010 15:55:56 CET, Dejan Muhamedagic ha scritto: [...] The problem seems to be in particular with collocations. Please see the other thread in which Andreas Kurz explained the differences well. Thanks, Thanks to you Dejan! I'm following with all the attentions that thread. Bye, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Trying to understand sets
Hi all, as discussed two days ago on IRC, since that the 1.0.9 version had some problems with multistate resources and groups, I'm trying to make sets work. I started from this configuration: group cluster cluster-fs cluster-jboss ms cluster-ms-r0 cluster-r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation cluster_on_cluster-r0 inf: cluster cluster-ms-r0:Master order cluster_after_cluster-r0 inf: cluster-ms-r0:promote cluster-fs:start But, as i said, there's a bug in this version of Pacemaker, so the solution is to treat every single resource. Thanks to the IRC guys match and andreask I find out a solution using sets, in this way: ms cluster-ms-r0 cluster-r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation cluster-fs_on_cluster-r0 inf: ( cluster-jboss ) ( cluster-fs ) cluster-ms-r0:Master order cluster-fs_after_cluster-r0 inf: cluster-ms-r0:promote ( cluster-fs:start ) ( cluster-jboss:start ) Everything works fine. But I have two questions: 1) what's the difference between the declaration with sets that is over there and this one: ms cluster-ms-r0 cluster-r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation cluster-fs_on_cluster-r0 inf: cluster-jboss cluster-fs cluster-ms-r0:Master order cluster-fs_after_cluster-r0 inf: cluster-ms-r0:promote cluster-fs:start cluster-jboss:start In this way the resources comes up all the same time? Note: this does not worked for me, of course. 2) Using sets it is also possible to declare groups just for logical purpose (and for the output of crm_mon), without using them in colocation and order declaration? Does the creation of a group make some changes in the resources relations? Thanks a lot! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat/Pacemaker Italian article series complete!
Il giorno Mer 22 Set 2010 13:44:07 CET, Dejan Muhamedagic ha scritto: [...] Didn't understand a thing, but looks great :) LOL! Just one note, it caught my attention, a node preference is usually expressed in non-absolute terms and using a shorter syntax: location cli-prefer-share-a share-a 100: ubuntu-nodo1 Cheers, Dejan I just used that declaration (location cli-prefer-share-a share-a rule inf: #uname eq ubuntu-nodo1) to reflect the rule that Pacemaker adds when you migrate a resource. This because in the second test I mention it. Anyway, I will keep in mind this short declaration, it will be surely useful. Thanks a lot Dejan! -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Heartbeat/Pacemaker Italian article series complete!
Hi all guys, Yesterday I've finally finished and published the last article of the Heartbeat/Pacemaker series. These are the links to the articles: http://www.miamammausalinux.org/2010/04/evoluzione-dellalta-affidabilita-su-linux-come-orientarsi-fra-hertbeat-pacemaker-openais-e-corosync/ http://www.miamammausalinux.org/2010/06/evoluzione-dellalta-affidabilita-su-linux-confronto-pratico-tra-heartbeat-classico-ed-heartbeat-con-pacemaker/ http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/ All the three articles are written in Italian, but I hope you will enjoy anyway. Keep up the good work! And thanks again for the help you give me anytime. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA configuration issues
Il giorno Mar 03 Ago 2010 19:36:56 CET, Tim Macking ha scritto: I have a system that is in production and has issues. While I appreciate the link of books to read, I really came here looking for some help or advice. If anyone could offer some, I would be most grateful and appreciative. Telling me to go read is like telling someone who wants to know what the Golden Rule is to go read the Bible. Please, can anyone offer some advice or help her? [...] I suggested that read because what you asked and what is described in CFS is very similar. Andrew's document contains many of the information you need (including how RedHat and Fedora prefer corosync instead of heartbeat). -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] HA configuration issues
Il giorno Lun 02 Ago 2010 15:25:36 CET, Tim Macking ha scritto: I am fairly new to Linux, specifically RedHat Enterprise. The project I have now is unraveling how 2 servers were setup with HA, why it is working (but not entirely), and how to get it configured correctly. I have read over the documentation at http://www.linux-ha.org/doc/ [...] I strongly recommend you to give a look at Clusters From Scratch, here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ It will be helpful to understand how the cluster can be configured from start. -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] NFSServer options question
Il giorno Ven 02 Lug 2010 18:50:29 CET, Daniel Machado Grilo ha scritto: Dear HA users, [...] I understand I have to add an nfsserver primitive for each group of services, as a group can migrate to other node and then, the heartbeat will not be able to unmount the FS because NFS is using it. [...] Hi Daniel, note that you can't have two nfsserver primitives declared on different groups in the same cluster: nfsserver primitive refers to the system /etc/exports, which must be the same on the two nodes. If you want an active-active NFS setup you should use the exportfs resource. Have a good day, -- RaSca Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems