Re: [Linux-HA] fence_apc always fails after some time and resources remains stopped

2013-11-28 Thread RaSca
Il giorno Ven 22 Nov 2013 10:26:08 CET, RaSca ha scritto:
[...]
 After this resources remains in stopped state. Why this happens? Am I in
 this case: https://github.com/ClusterLabs/pacemaker/pull/334 ?
 What kind of workaround can I use?
 Thanks a lot, as usual.

I don't know if my problem is the one described in the errata above, but
what I found for resolving my problem was to use fence_apc_snmp instead
of fence_apc. Since it is using snmp it is faster and it never timeout.

All  the other questions are still open, so if you want to give a
suggestion I am open for the discussion.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] fence_apc always fails after some time and resources remains stopped

2013-11-22 Thread RaSca
Hi there,
I'm using Pacemaker 1.1.10 on a Debian cluster of two machines, those
are connected to an APC power switch which I can contact via command
line in this way:

# fence_apc -a ACPADDR -x -l USER -p PASS -n 1 -o status
Status: ON

And for which I've configured two fence resource in this way:

primitive st_fence_scv1 stonith:fence_apc \
params ipaddr=APCADDR login=USER passwd=PASS
action=reboot verbose=true pcmk_host_check=static-list
pcmk_host_list=scv1 secure=true port=1 \
op monitor interval=60s

When doing a clean start everything works fine. The problem is that,
ALWAYS, after almost an hour the monitor operation of the resource
fails. From the logs I see this:

Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:12 [2661] scv1 stonith-ng: info: stonith_action_create:
   Initiating action monitor for agent fence_apc (target=(null))

So the monitor is launched, and then:

Nov 21 21:46:32 [2661] scv1 stonith-ng: info: st_child_term:
Child 20854 timed out, sending SIGTERM
Nov 21 21:46:32 [2661] scv1 stonith-ng:   notice:
stonith_action_async_done:Child process 20854 performing action
'monitor' timed out with signal 15
Nov 21 21:46:32 [2661] scv1 stonith-ng:   notice: log_operation:
Operation 'monitor' [20854] for device 'st_fence_scv2' returned: -62
(Timer expired)
Nov 21 21:46:32 [2665] scv1   crmd:error: process_lrm_event:
LRM operation st_fence_scv2_monitor_6 (464) Timed Out (timeout=2ms)

So, there is a timeout (and this maybe possible since those APC devices
are very slow).
After this the device is stopped:

Nov 21 21:46:42 [2662] scv1   lrmd: info: log_execute:
executing - rsc:st_fence_scv2 action:stop call_id:469
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)

And then restarted:

Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create:
   Initiating action metadata for agent fence_apc (target=(null))
Nov 21 21:46:42 [2661] scv1 stonith-ng:   notice:
stonith_device_register:  Device 'st_fence_scv2' already existed in
device list (1 active devices)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_register from lrmd.2662: OK (0)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_execute from lrmd.2662: Operation now in progress (-115)
Nov 21 21:46:42 [2661] scv1 stonith-ng: info: stonith_action_create:
   Initiating action monitor for agent fence_apc (target=(null))

The first thing that I find strange is the already existed in device
list, but anyway after this the monitor fails again:

Nov 21 21:47:02 [2661] scv1 stonith-ng: info: st_child_term:
Child 21265 timed out, sending SIGTERM
Nov 21 21:47:02 [2661] scv1 stonith-ng:   notice:
stonith_action_async_done:Child process 21265 performing action
'monitor' timed out with signal 15
Nov 21 21:47:02 [2661] scv1 stonith-ng:   notice: log_operation:
Operation 'monitor' [21265] for device 'st_fence_scv2' returned: -62
(Timer expired)
...
...
Nov 21 21:47:03 [2661] scv1 stonith-ng: info: stonith_command:
Processed st_device_remove from lrmd.2662: OK (0)

After this resources remains in stopped state. Why this happens? Am I in
this case: https://github.com/ClusterLabs/pacemaker/pull/334 ?

What kind of workaround can I use?

Thanks a lot, as usual.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Many location on ping resources and best practice for connectivity monitoring

2013-08-09 Thread RaSca
Il giorno Ven 09 Ago 2013 04:42:28 CEST, Andrew Beekhof ha scritto:
[...]
 That sounds like something playing with the virt bridge when the vm starts.
 Is the host trying to ping through the bridge too?

Yes. Is this not correct?

 Many location constraints can reference the attribute created by a single 
 ping resource.
 Its still not clear to me if you have one ping resource or one ping resource 
 per vm... don't do the second one.

I've got many locations based on the same cloned resource (which is
named ping).

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Many location on ping resources and best practice for connectivity monitoring

2013-08-08 Thread RaSca
Il giorno Gio 08 Ago 2013 01:07:06 CEST, Andrew Beekhof ha scritto:
 On 08/08/2013, at 12:37 AM, RaSca ra...@miamammausalinux.org wrote:
[...]
 The problem I got is that when I clone a VM (using virt-clone)
 everything works fine until I try to add a new ping check.
 Can you more precisely describe what you mean by this?

Of course: the steps for adding a new virtual machine are:

- put the original resource in unmanaged;
- clone the original resource via virt-clone;
- add a primitive for the new vm;
- add an oredr/colocation constraint over the storage;
- add a location based upon the ping like this one:

location loc_res_VirtualDomain_vm_connectivity res_VirtualDomain_vm \
rule -inf: not_defined ping or ping lte 0

At this point somethings breaks up. The ping resource of the node where
the vm will be placed fails, making all the resources on it migrate.

 1) Are there limitations about how many ping location can be declared?
 Well, there is a finite number of hosts that can be ping'd within a given 
 interval.
 Is your timeout too short perhaps? Are you using fping which works in 
 parallel?

I'm not using fping (maybe this could be a solution) and the timeout of
the ping resource is 20 (which for me, makes sense).

 2) Is this one (one vm = one ping location) the best practice to monitor
 the connections of the nodes?
 ping resources were intended to check if a cluster node could reach the 
 outside world.
 You're using them to check if a VM resource is alive?  Perhaps David's 
 remote-node stuff would be better suited.

I'm using them to check if a resource (the vm) is on a node which can
reach the outside world. So one vm = one location. Is there a way to set
a location on an entire node so that if it looses the outside world all
the resources on it are migrated? I was convinced that this kind of
locations should be set up on the single resource.

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Many location on ping resources and best practice for connectivity monitoring

2013-08-08 Thread RaSca
Il giorno Gio 08 Ago 2013 08:29:09 CEST, Ulrich Windl ha scritto:
 Hi!
 I don't know whether this helps, but in a different configuration we saw
 monitor timeouts for ipaddr2 when there was a high I/O load.
 Meanwhile we have upgraded all the software, but we had disabled most monitos
 for ipaddr2.
 Regards,
 Ulrich

Hi Ulrich,
thanks for you answer, I can confirm that ping fails when we have got an
high load.
How do you have managed the monitors for each node? You have just
disabled them or you used some other work around?

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Many location on ping resources and best practice for connectivity monitoring

2013-08-07 Thread RaSca
Hi all,
I have a big Pacemaker (1.1.9-1512) cluster with 9 nodes and almost 200
virtual machines (with the same storage on the bottom). Everything is
based upon KVM and libvirt.
Each VM has got a location, based upon a cloned ping resource on each
node that pings three hosts on the net.

The problem I got is that when I clone a VM (using virt-clone)
everything works fine until I try to add a new ping check.
At this time, for some reason the master ping resource of the node
fails, with errors like this:

Jul 30 15:34:58 kvm09 lrmd[23467]:  warning: child_timeout_callback:
res_ping_connections_monitor_5000 process (PID 26406) timed out

We're investigating on potentially network problems (obviously the
network men says that those are impossible, but when the problems
happens there are sometimes high ping latencies on the node), but what I
find very strange is that things breaks up ONLY when I add a location
based upon ping, not for example when I add the storage's order and
colocation for VM.

So my two questions:

1) Are there limitations about how many ping location can be declared?
2) Is this one (one vm = one ping location) the best practice to monitor
the connections of the nodes?

Thanks for your help,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Retransmit list and window_size

2013-05-03 Thread RaSca
Il giorno Ven 05 Apr 2013 15:29:36 CEST, RaSca ha scritto:
[...]
 It seem that when a configuration message has to run over the ring, in
 some particular cases, everything collapse. Following Florian's article
 I've tried setting up a window_size of 300, but since everything is the
 same, I think that with a default netmtu of 1500 and following the man
 page of corosync I must not go over 170 (which is 1500/300).
 The point is: what else can I check? Does it make sense to set a
 window_size LOWER than 50?
 Thanks for your help,

I answer to myself, maybe it will be useful for someone else.
There was no way of making multicast working in this network. It does
not depend on the window_size or other kind of parameters, sometimes it
breaks.
Even if multicast was tested successfully (with omping and also mnc)
sometimes the ring does not complete and I've got the retransmit list
that make the cluster crash.
The only solution I've found was to use unicast, declaring transport:
udpu in corosync.conf and a member section for each node in the cluster.
Doing this made everything up again, I still see the retransmit list
messages, but those are in the order of 1 in an hour, so it's fine.

Have you got other suggestions?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource move not moving

2013-04-16 Thread RaSca
Il giorno Mar 16 Apr 2013 15:50:07 CEST, Marcus Bointon ha scritto:
 I'm running crm using heartbeat 3.0.5 pacemaker 1.1.6 on Ubuntu Lucid 64.
[...]
 So if all that's true, why is that resource group still on the original node? 
 Is there something else I need to do?
 Marcus

Try using crm_resource with -f, for forcing.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Retransmit list and window_size

2013-04-05 Thread RaSca
Hi there,
In one of my clusters I still have problems with retransmit list
messages. The problem is not reproducible, sometimes while the cluster
is changing his state (for example when migrating a vm from one node to
another) it starts with the retransmit list messages and in the worst
case it loose quorum.

I followed what Florian wrote here:
http://www.hastexo.com/resources/hints-and-kinks/whats-totem-retransmit-list-all-about-corosync
but I still got some doubts.

I'm sure that this 9 node cluster is composed by identical machines and
I'm quite sure that the network multicast has no problems, even if the
nodes are distribuited on different enclosures. I said quite because
I've done some tests with tools like MNC and the connection seems to be
fine and not loosing anything.

It seem that when a configuration message has to run over the ring, in
some particular cases, everything collapse. Following Florian's article
I've tried setting up a window_size of 300, but since everything is the
same, I think that with a default netmtu of 1500 and following the man
page of corosync I must not go over 170 (which is 1500/300).

The point is: what else can I check? Does it make sense to set a
window_size LOWER than 50?

Thanks for your help,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Problem with exportfs

2013-02-13 Thread RaSca
Il giorno Mer 13 Feb 2013 17:36:13 CET, Dejan Muhamedagic ha scritto:
 Hi,
 On Wed, Feb 13, 2013 at 02:03:16PM +0100, Ulrich Windl wrote:
 Hi!
 I've made a patch to let exportfs propagate the errors it reported to the 
 exit code of the process (see attachments, the compressed tar is there in 
 case the mailer corrupts the patche files):
 You won't get the right audience here for exportfs (the
 program). I'm not sure where the NFS stuff is discussed, but
 there's probably a public forum somewhere.
 Thanks,
 Dejan
[...]

There's an NFS ML, here: linux-...@vger.kernel.org it's the place were I
asked (three years ago) about exportfs
(http://en.usenet.digipedia.org/thread/18978/8062/).

Bye,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problems with quorum, no-quorum-policy and NMI messages

2012-10-17 Thread RaSca
Il giorno Mar 16 Ott 2012 23:44:15 CEST, Lars Marowsky-Bree ha scritto:
[...]
 Depending on what kind of problem this node has, it could be that it
 erratically affects timing of network messages, or even sends garbage,
 which has the potential to mess up the totem protocol pretty much.
 What corosync version do you have?
 And yes, this is impossible to diagnose without the full cluster logs
 etc. A good candidate for bugzilla.
 Regards,
 Lars

Hi Lars,
thank you for your answer. I know that without the full logs doing a
coherent analysis is impossible, but as you can imagine there are a lot
of logs about this problem and yes, I will fill a bugzilla as soon as
possible.

Some other informations about the systems:

OS version: CentOS release 6.2 (Final)
Kernel version: 2.6.32-220.23.1.el6.x86_64
Corosync version: corosync-1.4.1-4.el6_2.3.x86_64

Going deep into the failed node I saw also these message:

ERST: Can not request iomem region
0x88103419be60-0x102068337cc0 for ERST.

From the Red Hat's Knowledge Base it seems that the root cause is a
kernel problem with the ERST (Error record Serialization Table) access.

The resolution suggested is to upgrade kernel versione 2.6.32-279.el6. I
just need to know if this error is a consequence of the original one
(NMI) or it is the cause. What I know is that it appeared after the NMI
error so, maybe, it is a consequence.

As I said, I will fill a bugzilla soon. Thanks again,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Time based resource stickiness example with crm configure ?

2012-08-30 Thread RaSca
Il giorno Gio 30 Ago 2012 14:53:45 CEST, Stefan Schloesser ha scritto:
 Hi,
 I would like to configure the resource-stickiness to 0 tuesdays between 2 
 and 2:20 am local time.
 I could not find any examples on how to do this using crm configure ... but 
 only the XML snippets to accomplish this.
 Could someone point me to the documentation or give me an example on the 
 syntax?
 Thanks,
 Stefan

Did you took a look to Pacemaker Explained? In any case you can
configure a cron job that runs the time you want and launch crm
configure property default-resource-stickiness=0.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Best way to know on which host a resource has failed and where it will be promoted

2012-08-26 Thread RaSca
Hi all,
I want to interact with the new master election. I don't know if I must
operate at a Resource Agent level or at cluster level, so I'm opened to
suggestions.

Suppose I've got a multi state resource for which I have one master and
two (or more) slaves. Suppose than that the master resource fails. At
this point and BEFORE the new master is elected I need to do a software
check (based upon the name of the failed host) that returns to me the
best new master. I want to pass this host to the cluster so that it can
promote this new host.

Is there a way to configure those kind of scripts at cluster level (like
we do with locationos) or I must interact with the resource agent
somewhere in the notify part (like assigning different weight
dynamically to the nodes)?

Thank you all for any suggestion.

-- 
Raoul Scarazzini
Solution Architect
MMUL: Niente è impossibile da realizzare, se lo pensi bene!
+39 3281776712
ra...@mmul.it
http://www.mmul.it

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-08-22 Thread RaSca
Il giorno Mer 22 Ago 2012 09:00:55 CEST, Ulrich Windl ha scritto:
[...]
 Hi!
 Amazingly the primary key (=ID) of the monitor operations is built using 
 the ineterval, not the role. So if you have to monitor operations with 
 the same interval, you have a resource conflict. It's documented, although 
 it's a sick concept...
 Decide whose fault it is... yours or the CRMs...
 Regards,
 Ulrich

Thank you Ulrich,
As far as you know, Is there a way to override the ID for each cloned
instance of the mysql resource? How can I resolve the problem?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Duplicate monitor operation on a multi state resource

2012-08-22 Thread RaSca
Il giorno Mer 22 Ago 2012 10:11:52 CEST, Lars Marowsky-Bree ha scritto:
 Just make the intervals slightly different - 31s, 30s, 29s ...
 Regards,
 Lars

Thank you Lars,
In fact, this is what I've done and now everything is ok. But I want to
understand one last thing: if the ID is calculated with the value of
interval then why I don't have errors even if I've got two slaves, which
means that I've got two identical intervals?

I hope to have made myself clear.

Thanks a lot,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Duplicate monitor operation on a multi state resource

2012-08-21 Thread RaSca
Hi all,
I'm trying to use the mysql resource agent to manage a setup with one
master and two slaves. This is the configuration of the mysql resource
and the master/slave one:

primitive resMySQL ocf:custom:mysql \
params binary=/usr/bin/mysqld_safe config=/etc/my.cnf
datadir=/var/lib/mysql user=mysql replication_user=myuser
replication_passwd=mypassword \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
op promote interval=0 timeout=120 \
op demote interval=0 timeout=120 \
op monitor interval=10 role=Master timeout=30 \
op monitor interval=10 role=Slave timeout=30
ms ms_resMySQL resMySQL \
meta master-max=1 master-node-max=1 clone-node-max=1
clone-max=3 notify=true globally-unique=false

The problem is that I see from the logs some errors like these:

Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Operation resMySQL-monitor-10-0 is a duplicate of
resMySQL-monitor-10
Aug 21 15:24:53 domU-12-31-39-0C-1A-2B pengine: [3816]: ERROR:
is_op_dup: Do not use the same (name, interval) combination more than
once per resource

and in fact, even if I manually kill the process in a node, the cluster
isn't aware and does not react.

What is wrong with this ms resource?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-07 Thread RaSca
Il giorno Mer 06 Giu 2012 23:03:49 CEST, Lars Ellenberg ha scritto:
[...]
 Two globally-unique clones I came accross in real life:
 Cluster IP buckets, in the sense of the iptables CLUSTERIP target.
 Sequences of IPs generated by the IPaddr2 resource,
 where the clone id is added to the base IP.
 Both will also need to allow clone-node-max  1,
 and one node will host more than one clone instances
 in the failover case.

Thank you Lars, now everything is more clear. Andrew: how about putting
another example in the Pacemaker Explained docs about this kind of
resources? I mean extending the Clones chapter (here
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-clone.html)
with another example of a globally-unique resource like the one
described by Lars.

The example that is there now is about an anonymous clone, so I think it
will be useful to have the box of this one under the anonymous
description and the globally-unique example below the Globally unique
description. For the stateful resources there is no problem since they
have a chapter apart.

Lars could please provide a sample xml part of the solution you have
suggested (one or the other is the same)? I can modify the docs by
myself and then submit the patch to Andrew.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-06 Thread RaSca
Hi all,
I've configured an NFS share which is cloned on each node of my cluster.
What I need to understand is how the globally-unique parameter applies
to this situation. Starting from it's definition:

Globally unique clones are distinct entities. A copy of the clone
running on one machine is not equivalent to another instance on another
node. Nor would any two copies on the same node be equivalent.

How this can be applied to filesystem resources? In addition, I've set
up this parameter to true, since my filesystems are identical on each
node, but does this makes sense?

Thanks a lot,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-06 Thread RaSca
Il giorno Mer 06 Giu 2012 16:53:28 CEST, Florian Haas ha scritto:
[...]
 Nope. :) Quite the contrary. It's the same filesystem you're mounting
 everywhere. That's a relatively classic anonymous clone. The
 globally-unique=false default should apply here.
 Cheers,
 Florian

Thank you Florian, but how can one declare an anonymous clone? Is it
implicit with the globally-unique=false?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Does globally-unique make sense on filesystems cloned resources?

2012-06-06 Thread RaSca
Il giorno Mer 06 Giu 2012 17:35:03 CEST, Lars Marowsky-Bree ha scritto:
 On 2012-06-06T17:26:41, RaSca ra...@miamammausalinux.org wrote:
 Thank you Florian, but how can one declare an anonymous clone? Is it
 implicit with the globally-unique=false?
 You don't need to explicitly declare that. It is the default.
 (But yes, the default is globally-unique=false.)
 Regards,
 Lars

Just for completeness: could you please mention a resource that might be
globally-unique?

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat question about multiple services

2012-05-09 Thread RaSca
Il giorno Ven 20 Apr 2012 12:42:16 CEST, sgm ha scritto:
 Hi,
 I have a question about heartbeat, if I have three services, apache, mysql 
 and sendmail,if apache is down, heartbeat will switch all the services to the 
 standby server, right?
 If so, how to configure heartbeat to avoid this happen?
 Very Appreciated.gm

You may want to start from here:
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Can a HA cluster be built with nodes in different VLANs?

2012-02-07 Thread RaSca
Il giorno Mer 08 Feb 2012 02:51:40 CET, Ryan Stepalavich ha scritto:
 Good evening,
 I'm currently attempting to build a LAMP high-availability cluster in
 Ubuntu 11.10. The trick is that each node is in a different VLAN. This
 causes Heartbeat to die when trying to fail over the hosted IP into an
 invalid VLAN.
 Site 1 VLAN: 10.204.200.0/24
 Site 2 VLAN: 10.204.202.0/24
 Is there a way around this issue?
 Thanks!

The only way out you have is to put a balancer over all, it will balance
the load in the two different lans.

You can do it with hardware load balancers and also with Linux LVS (i.e.
ldirectord).

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker : how to modify configuration ?

2011-11-29 Thread RaSca
Il giorno Lun 28 Nov 2011 15:04:45 CET, alain.mou...@bull.net ha scritto:
 Hi
 sorry but I forgot if there is another way than crm configure edit to 
 modify
 all the value of on-fail= for all resources in the configuration ? 
 Thanks 
 Alain


Why not using :%s/on-fail=.*/on-fail=newvalue/g in crm configure edit?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker : how to modify configuration ?

2011-11-29 Thread RaSca
Il giorno Mar 29 Nov 2011 09:30:41 CET, alain.mou...@bull.net ha scritto:
 Hi
 Yes I know it is possible this way, but I don't like to tell anybody to
 use crm configure edit because it is a command a little bit risky, risk
 of corruption of the file ... when I'm the person who operates, I often 
 use crm configure edit, but I'm a little reluctant to tell somebody else
 not really a pacemaker specialist to use this command. 
 So I'd prefer a command with cibadmin/grep/sed as Andrew suggest it.
 Thanks
 Alain

Consider that a bad configuration is not being processed by the crm
editor. In addition it is possible to do a dump of the actual
configuration before doing any modifications.
That said... If you're reluctant to make a non specialist users modify
the configuration, then why let them modify delicate parameters such as
on-fail?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] LCMC NumberFormatException

2011-10-25 Thread RaSca
Il giorno Gio 13 Ott 2011 09:28:33 CEST, RaSca ha scritto:
 Il giorno Mer 12 Ott 2011 22:22:15 CEST, Rasto Levrinc ha scritto:
 [...]
 Well, LCMC doesn't handle k in this type of fields, as Caspar said.
 Will be fixed. As a workaround you can set it to 131072.
 Rasto
 Clear, thanks.

With LCMC 1.0.2 the problem is solved. Thank you Rasto!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] LCMC NumberFormatException

2011-10-13 Thread RaSca
Il giorno Mer 12 Ott 2011 22:22:15 CEST, Rasto Levrinc ha scritto:
[...]
 Well, LCMC doesn't handle k in this type of fields, as Caspar said.
 Will be fixed. As a workaround you can set it to 131072.
 Rasto

Clear, thanks.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] LCMC NumberFormatException

2011-10-12 Thread RaSca
Hi all,
I'm facing this problem while connecting to one of my pacemaker clusters 
with LCMC (known also as DMC):

AppError.Text
release: 1.0.1
java: Sun Microsystems Inc. 1.6.0_26

uncaught exception
java.lang.NumberFormatException: For input string: 128k
For input string: 128k
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
java.lang.Long.parseLong(Long.java:419)
java.lang.Long.parseLong(Long.java:468)
lcmc.data.DrbdXML.checkParam(DrbdXML.java:419)
lcmc.gui.resources.DrbdResourceInfo.checkParam(DrbdResourceInfo.java:238)
lcmc.gui.resources.EditableInfo.checkResourceFieldsCorrect(EditableInfo.java:957)
lcmc.gui.resources.DrbdResourceInfo.checkResourceFieldsCorrect(DrbdResourceInfo.java:792)
lcmc.gui.resources.DrbdResourceInfo.checkResourceFieldsCorrect(DrbdResourceInfo.java:728)
lcmc.gui.resources.EditableInfo$7.run(EditableInfo.java:601)
java.lang.Thread.run(Thread.java:662)

What the problem should be?

Please tell me also if this message is off topic for the list...

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] LCMC NumberFormatException

2011-10-12 Thread RaSca
Il giorno Mer 12 Ott 2011 15:09:24 CEST, Rasto Levrinc ha scritto:
 On Wed, Oct 12, 2011 at 2:49 PM, Caspar Smitc.s...@truebit.nl  wrote:
 Hi Rasca,
 Probably the sndbuf-size (or any other variable) in drbd.conf is set to 128K
 That shouldn't be a problem, normally. Rasca, what parameters are set
 to 128k and what
 DRBD version(s) do you have? Maybe 128K would work.
 Rasto

max-buffers 128k; in global.conf

The DRBD version is 8.3.10.

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

2011-09-27 Thread RaSca
Il giorno Mar 27 Set 2011 06:28:00 CEST, Andrew Beekhof ha scritto:
[...]
 /stop/ failed, your on-fail setting only applies to the /monitor/ operation

Yes Andrew, now it's absolutely clear. Increasing the timeout for the 
migration to 240s and setting the on-fail=block for the stop operation 
solved my problems, even if I think that just the timeouts have done the 
trick, setting also the on-fail gives me more control over the errors on 
the migrations, since a single vm migration failure does not make an 
entire node fence.

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Resource fail and node fence

2011-09-21 Thread RaSca
Il giorno Mar 20 Set 2011 17:54:58 CEST, Dejan Muhamedagic ha scritto:
[...]
 And I completely agree with this, but in an environment like mine, where
 a single resource failure might involve all the others (with fence) is
 wronk to keep this kind of settings. Do you agree with me?
 No. If the resource cannot stop, then something's wrong either
 with the resource or with the RA. And needs to be fixed.

And this is for sure. But I cannot make all resources on a node stop and 
migrate (because of fence) just because one of them has failed.
Those resources are not connected one to each other, so it is more 
reasonable to keep the survived resources alive, or at least live 
migrating them (on-fail=standby) to the other node and THEN reboot the 
first one.

[...]
 If a resource fails to stop, then on-fail=stop cannot possibly
 help. Furthermore, you basically make this resource less
 available (the cluster won't try to recover it). Must be that
 I'm missing something.
 At any rate, I don't think that you need to fiddle with the
 on-fail attribute, but see what's wrong with the RA or libvirt
 or the combination of the two.
 Thanks,
 Dejan

Then my question is why the attribute on-fail was created? I repeat, I 
totally agree with the fact that there are some problems with the RA but 
until I find out exactly what's wrong I had to take care of all my 
cluster's resources and so the most reasonable thing is to keep one 
single failed vm stopped.

Thanks,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-20 Thread RaSca
Il giorno Lun 19 Set 2011 21:20:12 CEST, Michael Schwartzkopff ha scritto:
 Il giorno Lun 19 Set 2011 15:17:30 CEST, Michael Schwartzkopff ha scritto:
 [...]
 What transport do you use? Which version of libvirt?
 Transport ssh, libvirt version is 0.9.2-7 (the squeeze-backports version).
 I get an error:
 migration job: unexpectedly failed
 Only if I add the above mentioned options to the migration it works.

Hi Micheal, does this happens *every time* you try to migrate the 
resource? I'm facing a strange problem with virtualdomain RA and node 
fence (see my post on linux-ha ML Understanding why a host fence (was: 
Resource fail and node fence)) and I'm trying to understand where is 
the problem.

Have you tried using another transport such ad tcp over sasl or tls?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-20 Thread RaSca
Il giorno Mar 20 Set 2011 11:11:58 CEST, Michael Schwartzkopff ha scritto:
[...]
 Yes. This happens every time. Also when I use the virsh command line to
 migrate the virtual machine.

So the problem is inside libvirt. But are you using the same libvirt 
version of mine (0.9.2-7)?

 Your problem. Sorry, I have never seen this. I had several other issues with
 VirtualDomain. Please see my patches to the RA:
 http://www.gossamer-threads.com/lists/linuxha/dev/74103

I saw your path and might be very useful. My problem is exactly on stop. 
When the error come up, the vm is in state paused and it seems to be 
this to make things broke up.

 First tests with tcp show that it also needs the --p2p --tunnelled options. I
 did not try sasl.
 Greetings,

Very strange. I think that the problem may be somewhere else (network 
connection, collisions, physical staff)...

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-20 Thread RaSca
Il giorno Mar 20 Set 2011 12:27:20 CEST, Michael Schwartzkopff ha scritto:
[...]
 First tests with tcp show that it also needs the --p2p --tunnelled
 options. I did not try sasl.
 Greetings,
 Very strange. I think that the problem may be somewhere else (network
 connection, collisions, physical staff)...
 No. See man virsh.
 I also have 0.9.2-7~bpo60+1.

Man page says only the p2p is peer to peer migration and tunnelled is 
tunnelled migration. And I agree with this :-)

For what I've found:

- tunnelled migration sets a migration in progress in the background, so 
that libvirt is able to cancel a migration when a problem has happened.

- peer2peer make the libvirtd server connect to the destination libvirtd 
server directly (peer-to-peer) setting up a secure channel.

In the first case I may agree that the parameter should help (even only 
on failures), but for p2p... I can't say it.
Anyway, it doesn't explain why for me this version works correctly. If 
you want I can send you my crm and libvirt configurations so we can do 
some comparisons.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-20 Thread RaSca
Il giorno Mar 20 Set 2011 13:45:30 CEST, Michael Schwartzkopff ha scritto:
[...]
 the cluster is completely irrelevant here. the plain command
 virsh migrateguest  qemu+ssh://other_node/system
 doesn't work here. I need the options --p2p --tunnelled.
 So this is a libvirt issue, no cluster issue.
 But if there parameters are really needed, the VirtualDomain has to be
 patched.
 Greetings,

This is for sure. Anyway this is my libvirt.conf:

listen_tls = 0
unix_sock_group = libvirt
unix_sock_rw_perms = 0770
auth_unix_ro = none
auth_unix_rw = none

I haven't changed anything else.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

2011-09-20 Thread RaSca
Hi all,
I start a new thread because I've got more debug details to analyze my 
situation, and starting from the beginning might be better.

My environment is composed by two machine connected to a network and one 
to each other. The cluster runs a lot of virtual machines, each one 
based upon a dual primary drbd. The two systems are Debian Squeeze with 
backports:

kernel 2.6.39-3
drbd 8.3.10-1
corosync 1.3.0-3
pacemaker 1.0.11-1
libvirt-bin 0.9.2-7

The (dual-primary) drbd resources are declared in this way:

primitive vm-1_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=20s role=Master timeout=20s \
op monitor interval=30s role=Slave timeout=20s \
op start interval=0 timeout=240s \
op stop interval=0 timeout=100s

ms vm-1_ms-r0 vm-1_r0 \
meta notify=true master-max=2 clone-max=2 interleave=true

and the virtual machine are like this:

primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \
params config=/etc/libvirt/qemu/vm-1.xml hypervisor=qemu:///system 
migration_transport=ssh force_stop=true \
meta allow-migrate=true \
op monitor interval=10s timeout=30s on-fail=restart depth=0 \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s

There are colocation and order for each vm:

colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master
order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start

And there is a location constraint for the connectivity:

location vm-1_ON_CONNECTED_NODE vm-1 \
rule $id=vm-1_ON_CONNECTED_NODE-rule -inf: not_defined ping or ping 
lte 0

The problem is that every night I've scheduled a live migration of a vm, 
but if this fails, then the node gets fenced, even if the on-fail 
parameter of the vm is set to restart.
Everything starts at 23:

Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource 
-M -r vm-1

Two seconds later the first problem:

Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation 
monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148, 
its parameters: hypervisor=[qemu:///system] CRM_m
eta_depth=[0] CRM_meta_timeout=[3] force_stop=[true] 
config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0] 
crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito
r] migration_transport=[ssh] CRM_meta_interval=[1]  cancelled

why this operation is marked ad cancelled? Anyway, after 22 seconds, the 
operation fails with Timed Out:

Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM 
operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=2ms)

Force shutdown is invoked:

Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced 
shutdown (destroy) request for domain vm-1.

and even if the vm appears to be destroyed (the kernel messages confirm 
the the vmnet devices were destroyed), the RA seems to ignore it:

Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: 
(vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1
Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: 
(vm-1_virtualdomain:stop:stderr) error: Requested operation is not 
valid: domain is not running
Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM 
operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445, 
confirmed=true) unknown error

In the meantime on the other node, since some errors are discovered:

Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: 
Migrating vm-1_virtualdomain from node-2 to node-1
Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: 
Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (100)
...
...
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing 
failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing 
failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2 
will be fenced to recover from resource failure(s)

a STONITH is invoked...

Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid: 
2314] requests a STONITH operation RESET on node node-2

...with success:

Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the 
node node-2: optype=RESET. whodoit: node-1

My conclusions are:

1 - the fence has nothing to do with drbd (there is no mention to it 
until the reset is done);

2 - for some reason live migrating the vms SOMETIMES fails, even if once 
the system has recovered I can do a crm resource move vm-1 with ANY problem.

3 - Even if the vm fails to stop the cluster does not try to restart it, 
but simply fence the node, and this is not what the on-fail parameter is 
meant to do.

Does someone have some suggestions on how to debug more this problem? 
Please help!

Thanks a lot,

-- 
RaSca
Mia Mamma Usa Linux: Niente è

Re: [Linux-HA] Resource fail and node fence

2011-09-20 Thread RaSca
Il giorno Mar 20 Set 2011 12:53:40 CEST, Dejan Muhamedagic ha scritto:
[...]
 on-fail is a per-operation attribute. By default, it is set to
 fence only for the stop operation. The point is that a failed
 stop means that the cluster cannot establish the state of the
 resource anymore, so the only remedy remaining is to fence the
 node.
 Thanks,
 Dejan

And I completely agree with this, but in an environment like mine, where 
a single resource failure might involve all the others (with fence) is 
wronk to keep this kind of settings. Do you agree with me?
It has more sense to have the resource stopped or unmanaged instead of 
doing a node fence.
So, even if I had to patch the Virtualdomain RA with Micheal's one 
(http://www.gossamer-threads.com/lists/linuxha/dev/74103) it may be 
useful for me to set the stop operations on-fail parameter to stop, or 
maybe block, like this:

op monitor interval=10 timeout=30 start-delay=0 on-fail=stop

Am I right?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-19 Thread RaSca
Il giorno Lun 19 Set 2011 12:17:30 CEST, Michael Schwartzkopff ha scritto:
 Hi,
 I tested migration with a recent version of libvirt. Essentially it was from
 squeeze-backports. The transport qemu+ssh://other_node/system now needs the
 additional parameters
 --p2p --tunnelled
 in the migration command line. So it should be:
 virsh migrateguest  --p2p --tunnelled qemu+ssh://other_node/system
 I will try to write a patch and publish it here.
 Greetings,

Hi Michael,
I'm using Debian Squeeze with Pacemaker and libvirt from backports, but 
I didn't have to make any patch to make live migration work.

Have you got a specific setup?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Issue with VirtualDomain

2011-09-19 Thread RaSca
Il giorno Lun 19 Set 2011 15:17:30 CEST, Michael Schwartzkopff ha scritto:
[...]
 What transport do you use? Which version of libvirt?

Transport ssh, libvirt version is 0.9.2-7 (the squeeze-backports version).

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Resource fail and node fence

2011-09-19 Thread RaSca
Il giorno Lun 19 Set 2011 17:55:58 CEST, Dejan Muhamedagic ha scritto:
 Hi,
 On Wed, Sep 14, 2011 at 09:43:43AM +0200, RaSca wrote:
 Hi all,
 I've got a two node pacemaker/corosync cluster with some virtual domain
 resources on some DRBD devices.
 Every DRBD device is configured in dual primary setup and I have enabled
 the live migration. Cluster has also stonith enabled.
 My problem is that if a live migration for a single virtualdomain
 resource fails, then this node gets fenced, making unavailable also all
 AFAIK, failing migration shouldn't result in node fence. I guess
 that actually the subsequent stop operation failed, right? In
 that case, that's probably a bug somewhere in the RA or VM code.
 Thanks,
 Dejan

Hi Dejan,
thanks as usual for your response. In the end, since that I was facing 
too much unexplainable problems I decided to upgrade libvirt and the 
kernel itself to a newer version (from squeeze to squeeze-backports). 
Until now problems seems to be resolved.

In Pacemaker Explained (Andrew, I'm almost finished with the 
translation, I swear!) it is written that the default action on fail is 
fence, so it is assumed that if a single resource fails, then the 
entire node is fenced. Note that at the moment every of my virtualdomain 
resource have got the on-fail action set with restart, and I've not 
faced any fence.
But please, help me to understand this: what do you mean with 
subsequent stop operation? It is very plausible that this was the 
reason, since the failed virtual machines were in state paused even if 
I was forcing the stop. Does this is enough to make a node fence? Why 
this failure is not considered in on-fail parameter declaration?

Do I made myself clear?

Thanks a lot,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Resource fail and node fence

2011-09-14 Thread RaSca
Hi all,
I've got a two node pacemaker/corosync cluster with some virtual domain 
resources on some DRBD devices.
Every DRBD device is configured in dual primary setup and I have enabled 
the live migration. Cluster has also stonith enabled.

My problem is that if a live migration for a single virtualdomain 
resource fails, then this node gets fenced, making unavailable also all 
the other virtual machines (that gets restarted on the other node after 
a poweroff).

As I saw the way to make a single resource fail not fencing the node 
where it fails is to declare an on-fail=restart option for the virtual 
domain. Is it the correct approach or is there a more elegant way to 
obtain what I want?

Thanks to all,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Linux HA] Problem with Apache

2011-06-08 Thread RaSca
Il giorno Mer 08 Giu 2011 11:26:01 CET, Alfredo Parisi ha scritto:
 Hi and thanks for the response.
 if I open my browser with my local ip, I see that the web server doesn't
 works.
 Which log do you want?
 apache2 or corosync? Sorry but I'm a newbie on Linux HA. Thanks
 UPDATE:
 So i've checked again and in the first time I've removed apache at the boot,
 but if I start the service apache, it works in the both nodes, with the both
 local IP and with the Virtual Ip
 So in which mode can I install Drupal on my Cluster?

Whoa. There are a lot of things you need to clarify to yourself. First 
of all, forget about the CMS for now and concentrate on putting into the 
cluster the service Apache. As I said, you need to check why in the 
cluster configuration Apache does not works, and this reason is in the logs.
After that, installing the CMS will be trivial.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Linux HA] Problem with Apache

2011-06-08 Thread RaSca
Il giorno Mer 08 Giu 2011 11:53:54 CET, Alfredo Parisi ha scritto:
 Ok thank you.
 Now apache works on the virtual ip (10.10.7.100).
 With crm_mon I haven't errors, this is my situation:
[...]
 Now can I install the CMS? Thanks

I hope so :-)

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] What means this type of errors ?

2011-06-07 Thread RaSca
Il giorno Mar 07 Giu 2011 11:16:56 CET, alain.mou...@bull.net ha scritto:
 Sorry , some mistakes in preivous logs, here are the real ones :
[...]

You need to look for your problems BEFORE these logs. The problem is 
with a Filesystem, so you need to search for errors concerning this 
resource by looking for it's name BEFORE all those messages that shows 
how it has been sent away from a node.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-27 Thread RaSca
Il giorno Gio 26 Mag 2011 23:01:48 CET, Lars Ellenberg ha scritto:
[...]
 Would that not be a power on?
 alright, seems to just be change boot settings,
 not boot per se.
 Oh well...

Yes, as you saw, it is not some kind of boot mode, but just a setting. 
Note, in addition, that there's also a wake-on-lan function that maybe 
used, but there is no power off function and so, some kind of poweron 
will be not useful.

Thanks for all your suggestions.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-27 Thread RaSca
Il giorno Ven 27 Mag 2011 14:41:32 CET, Dejan Muhamedagic ha scritto:
 Hi,
[...]
 # Can't really be implemented because Hetzner webservice cannot power 
 on a system
 Replace comments with ha_log.sh calls.

Hi Dej,
everything else it's clear and I've already updated the script, but what 
do you mean with this? What's the matter with the comments? The only way 
I found to call ha_log.sh is with err, warn, info or debug.

How comments can be part of this?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-27 Thread RaSca
Il giorno Ven 27 Mag 2011 15:02:11 CET, RaSca ha scritto:
 Il giorno Ven 27 Mag 2011 14:41:32 CET, Dejan Muhamedagic ha scritto:
 Hi,
 [...]
# Can't really be implemented because Hetzner webservice cannot power 
 on a system
 Replace comments with ha_log.sh calls.
 Hi Dej,
 everything else it's clear and I've already updated the script, but what
 do you mean with this? What's the matter with the comments? The only way
 I found to call ha_log.sh is with err, warn, info or debug.
 How comments can be part of this?

Wait, I think I understand, do you mean something like:

ha_log.sh warn Can't really be implemented because Hetzner webservice 
cannot power on a system

Am I right?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-27 Thread RaSca

Il giorno Ven 27 Mag 2011 15:33:24 CET, Dejan Muhamedagic ha scritto:
[...]

Yes. It's because users cannot read comments that easily, they
usually look at the logs. If at all.
Cheers,


New (and hopefully last) version attached.

Bye,

--
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
#!/bin/sh
#
# External STONITH module for Hetzner.
#
# Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

# Read parameters from config file, format is based upon the hetzner OCF 
resource agent
# developed by Kumina: 
http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/
conf_file=/etc/hetzner.cfg
user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg`
pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg`
hetzner_server=https://robot-ws.your-server.de;

check_http_response() {
 # If the response is 200 then return 0
 if [ $1 = 200 ]
  then
   return 0
  else
   # If the response is not 200 then display a description of the problem 
and return 1
   case $1 in
400) ha_log.sh err  INVALID_INPUT - Invalid input parameters
   ;;
404) ha_log.sh err  SERVER_NOT_FOUND - Server with ip $remote_ip not 
found
   ;;
409) ha_log.sh err  RESET_MANUAL_ACTIVE - There is already a running 
manual reset
   ;;
500) ha_log.sh err RESET_FAILED - Resetting failed due to an internal 
error
   ;;
   esac
   return 1
 fi
}

case $1 in
gethosts)
echo $hostname
exit 0
;;
on)
# Can't really be implemented because Hetzner's webservice cannot power 
on a system
ha_log.sh err Power on is not available since Hetzner's webservice 
can't do this operation.
exit 1
;;
off)
# Can't really be implemented because Hetzner's webservice cannot power 
on a system
ha_log.sh err Power off is not available since Hetzner's webservice 
can't do this operation.
exit 1
;;
reset)
# Launching the reset action via webservice
check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u 
$user:$pass $hetzner_server/reset/$remote_ip -d type=hw)
exit $?
;;
status)
# Check if we can contact the webservice
check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u 
$user:$pass $hetzner_server/server/$remote_ip)
exit $?
;;
getconfignames)
echo hostname
echo remote_ip
exit 0
;;
getinfo-devid)
echo Hetzner STONITH device
exit 0
;;
getinfo-devname)
echo Hetzner STONITH external device
exit 0
;;
getinfo-devdescr)
echo Hetzner host reset
echo Manages the remote webservice for reset a remote server.
exit 0
;;
getinfo-devurl)
echo http://wiki.hetzner.de/index.php/Robot_Webservice_en;
exit 0
;;
getinfo-xml)
cat  HETZNERXML
parameters
parameter name=hostname unique=1
content type=string /
shortdesc lang=en
Hostname
/shortdesc
longdesc lang=en
The name of the host to be managed by this STONITH device.
/longdesc
/parameter

parameter name=remote_ip unique=1 required=1
content type=string /
shortdesc lang=en
Remote IP
/shortdesc
longdesc lang=en
The address of the remote IP that manages this server.
/longdesc
/parameter
/parameters
HETZNERXML
exit 0
;;
*)
ha_log.sh err Don't know what to do for '$remote_ip'
exit 1
;;
esac
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-26 Thread RaSca

Il giorno Mar 24 Mag 2011 16:10:17 CET, Lars Ellenberg ha scritto:
[...]

exit $((! $?))
That is going to invert the code.

the shell has ! for that.
! is_host_up
Coffee?
  ;-)


Ok, following your suggestions I've modified (and tested, of course) the 
script. Compacting as much as I can. But Lars, sorry how didn't find out 
how to compact this:


 is_host_up $remote_ip
 exit $((! $?))

in this:

exit ! is_host_up $remote_ip

:-(

Anyway, now I've got a deeper problem: I was totally misunderstanding 
what's the status field of the curl interrogation was meant for. So I 
corrected the is_host_up function, making it check (similar to ssh 
stonith agent) via nc if the ssh port is responding:


is_host_up() {
  /bin/nc -w 1 -z $1 22  /dev/null 21
  return $?
}

This sounds quite weird to me, but I can't do ping, and I can't control 
the state of the machine otherwise. I can only force the reset and then 
check if the machine is up. If you have any suggestions on how to make 
things better, don't hesitate...


Beyond all, it works. I set up a variable which is the timeout to wait 
before check the machine status. It can become a parameter.


The new version is attached.

--
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

#!/bin/sh
#
# External STONITH module for Hetzner.
#
# Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

# Read parameters
conf_file=/etc/hetzner.cfg
user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg`
pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg`
hetzner_server=https://robot-ws.your-server.de;
wait_timeout=15

is_host_up() {
 /bin/nc -w 1 -z $1 22  /dev/null 21
 return $?
}

case $1 in
gethosts)
echo $hostname
exit 0
;;
on)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
off)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
reset)
curl -s -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw  
/dev/null 21
sleep $wait_timeout
is_host_up $remote_ip
exit $((! $?)) 
;;
status)
is_host_up $remote_ip
exit $?
;;
getconfignames)
echo hostname
exit 0
;;
getinfo-devid)
echo Hetzner STONITH device
exit 0
;;
getinfo-devname)
echo Hetzner STONITH external device
exit 0
;;
getinfo-devdescr)
echo Hetzner host reset
echo Manages the remote webservice for reset a remote server.
exit 0
;;
getinfo-devurl)
echo http://wiki.hetzner.de/index.php/Robot_Webservice_en;
exit 0
;;
getinfo-xml)
cat  HETZNERXML
parameters
parameter name=hostname unique=1
content type=string /
shortdesc lang=en
Hostname
/shortdesc
longdesc lang=en
The name of the host to be managed by this STONITH device.
/longdesc
/parameter

parameter name=remote_ip unique=1 required=1
content type=string /
shortdesc lang=en
Remote IP
/shortdesc
longdesc lang=en
The address of the remote IP that manages this server.
/longdesc
/parameter
/parameters
HETZNERXML
exit 0
;;
*)
exit 1
;;
esac

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [Linux-HA] Hetzner server stonith agent

2011-05-26 Thread RaSca

Il giorno Gio 26 Mag 2011 11:13:46 CET, RaSca ha scritto:
[...]

The new version is attached.


Hi all,
After talking with Dejan on IRC, here it is the new version of the agent.
Major changes:

- The script do not relies anymore on SSH for checking the correct fence 
of the device, instead it checks the http response code from the webservice;


- The status action looks for a 200 response from the webservice in 
GET mode;


- In case of problems, the return code of the RA is 1 and I also added a 
description of the problem (check_http_response function);


That's all, last but not least, it works.

To make things perfect I just ask to Lars (following two days ago 
discussion) if there's a way to compact this:


check_http_response $(curl --silent -o /dev/null -w 
'%{http_code}' -u $user:$pass $hetzner_server/reset/$remote_ip -d type=hw)

exit $?

to a one line statement.

Thanks everybody for the help,

--
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
#!/bin/sh
#
# External STONITH module for Hetzner.
#
# Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

# Read parameters from config file, format is based upon the hetzner OCF 
resource agent
# developed by Kumina: 
http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/
conf_file=/etc/hetzner.cfg
user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg`
pass=`sed -n 's/^pass.*=\ *//p' /etc/hetzner.cfg`
hetzner_server=https://robot-ws.your-server.de;

check_http_response() {
 # If the response is 200 then return 0
 if [ $1 = 200 ]
  then
   return 0
  else
   # If the response is not 200 then display a description of the problem 
and return 1
   case $1 in
400) echo INVALID_INPUT - Invalid input parameters
   ;;
404) echo SERVER_NOT_FOUND - Server with ip $remote_ip not found
   ;;
409) echo RESET_MANUAL_ACTIVE - There is already a running manual 
reset
   ;;
500) echo RESET_FAILED - Resetting failed due to an internal error
   ;;
   esac
   return 1
 fi
}

case $1 in
gethosts)
echo $hostname
exit 0
;;
on)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
off)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
reset)
# Launching the reset action via webservice
check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u 
$user:$pass $hetzner_server/reset/$remote_ip -d type=hw)
exit $?
;;
status)
# Check if we can contact the webservice
check_http_response $(curl --silent -o /dev/null -w '%{http_code}' -u 
$user:$pass $hetzner_server/server/$remote_ip)
exit $?
;;
getconfignames)
echo hostname
echo remote_ip
exit 0
;;
getinfo-devid)
echo Hetzner STONITH device
exit 0
;;
getinfo-devname)
echo Hetzner STONITH external device
exit 0
;;
getinfo-devdescr)
echo Hetzner host reset
echo Manages the remote webservice for reset a remote server.
exit 0
;;
getinfo-devurl)
echo http://wiki.hetzner.de/index.php/Robot_Webservice_en;
exit 0
;;
getinfo-xml)
cat  HETZNERXML
parameters
parameter name=hostname unique=1
content type=string /
shortdesc lang=en
Hostname
/shortdesc
longdesc lang=en
The name of the host to be managed by this STONITH device.
/longdesc
/parameter

parameter name=remote_ip unique=1 required=1
content type=string /
shortdesc lang=en
Remote IP
/shortdesc
longdesc lang=en
The address of the remote IP that manages this server.
/longdesc
/parameter
/parameters
HETZNERXML
exit 0
;;
*)
exit 1
;;
esac

Re: [Linux-HA] Colocation of VIP and httpd

2011-05-24 Thread RaSca
Il giorno Gio 19 Mag 2011 19:25:54 CET, 吴鸿宇 ha scritto:
 Hi All,
 I have a 2 node cluster. My intention is ensuring the VIP is always on the
 node that has httpd running, i.e. if service httpd on the VIP node is
 stopped and fails to start, the VIP should switch to the other node.
 With the configuration below, I observed that when httpd stops and fails to
 start, the VIP is stopped also but is not switched to the other node that
 has healthy httpd. I appreciate any ideas.
[...]

Some questions:
Why httpd is cloned? Are you sure you want an INFINITY stickiness? Are 
logs saying anything helpful?

Anyway, like Nikita said, consider upgrading Heartbeat to version 3.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Hetzner server stonith agent

2011-05-24 Thread RaSca

Hi all,
as some of you saw in the last two weeks I've faced some problems in 
configuring a Corosync/Pacemaker cluster on two Hetzner server.


The main problem about those cheap and very powerful servers is their 
network management. For example, if you want to have a failover IP you 
need to manage it by the web interface or via a webservice, there's no 
other way.
Luckily, the guys from Kumina 
(http://blog.kumina.nl/2011/02/hetzner-failover-ip-ocf-script/) wrote an 
ocf resource agent that automates the management of the IP so the last 
(but not least) problem was the Stonith.


In the intention of Hetzner the only way you have to force a reset of 
the machine is... via the same webserver. I know, it's odd, but also in 
this case it is the only way. So, following those directives: 
http://wiki.hetzner.de/index.php/Robot_Webservice_en I wrote the stonith 
agent that is attached to this email.

It is based upon the same configuration file of the Kumina's ocf:

# cat /etc/hetzner.cfg
[dummy]
user = username
pass = password
local_ip = local ip address of the server

And it needs two parameters: the hostname and it's related 
remote_ip, for example:


primitive stonith_hserver-1 stonith:external/hetzner \
params hostname=hserver-1 remote_ip=X.Y.Z.G \
op start interval=0 timeout=60s

First of all, it works. The system is able to fence nodes in case of 
split brain and manually, so I can say it is ok. But it is the first 
stonith agent that I wrote, so it may need some corrections.


Hope this can help someone. Thanks to andreask who helped me on irc in 
understanding how stonith agents works.


--
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
#!/bin/sh
#
# External STONITH module for Hetzner.
#
# Copyright (c) 2011 MMUL S.a.S. - Raoul Scarazzini ra...@mmul.it
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

# Read parameters
conf_file=/etc/hetzner.cfg
user=`cat /etc/hetzner.cfg | egrep ^user.*= | sed 's/^user.*=\ *//g'`
pass=`cat /etc/hetzner.cfg | egrep ^pass.*= | sed 's/^pass.*=\ *//g'`
hetzner_server=https://robot-ws.your-server.de;

is_host_up() {
  if [ $1 !=  ]
   then
status=`curl -s -u $user:$pass $hetzner_server/server/$1 | sed 
's/.*status\:\([A-Za-z]*\),.*/\1/g'`
if [ $status = ready ]
 then
  return 0
 else
  return 1
fi
   else
return 1
  fi
}

case $1 in
gethosts)
echo $hostname
exit 0
;;
on)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
off)
# Can't really be implemented because Hetzner webservice cannot power 
on a system
exit 1
;;
reset)
status=`curl -s -u $user:$pass $hetzner_server/reset/$remote_ip -d 
type=hw`
if [ $status =  ]
 then
  exit 1
 else
  if is_host_up $hostaddress
   then
exit 1
   else
exit 0
  fi
fi
exit 1
;;
status)
if [ $remote_ip !=  ]
 then
  if is_host_up $remote_ip
   then
exit 0
   else
exit 1
  fi
 else
  # Check if we can contact the server
  status=`curl -s -u $user:$pass $hetzner_server/server/`
  if [ $status =  ]
   then
exit 1
   else
exit 0
  fi
fi
;;
getconfignames)
echo hostname
exit 0
;;
getinfo-devid)
echo Hetzner STONITH device
exit 0
;;
getinfo-devname)
echo Hetzner STONITH external device
exit 0
;;
getinfo-devdescr)
echo Hetzner host reset
echo Manages the remote webservice for reset a remote server.
exit 0
;;
getinfo-devurl)
echo http://wiki.hetzner.de/index.php/Robot_Webservice_en;
exit 0
;;
getinfo-xml)
cat

Re: [Linux-HA] Hetzner server stonith agent

2011-05-24 Thread RaSca
Il giorno Mar 24 Mag 2011 12:27:04 CET, Dejan Muhamedagic ha scritto:
 Hi,

Hi Dejan,

[...]
 # Read parameters
 conf_file=/etc/hetzner.cfg
 user=`cat /etc/hetzner.cfg | egrep ^user.*= | sed 's/^user.*=\ *//g'`
 Better:
 user=`sed -n 's/^user.*=\ *//p' /etc/hetzner.cfg`

Absolutely agree.

 pass=`cat /etc/hetzner.cfg | egrep ^pass.*= | sed 's/^pass.*=\ *//g'`
 hetzner_server=https://robot-ws.your-server.de;
 I assume that this is a well-known URL which doesn't need to be
 passed as a parameter.

As far as I know it is the only address, I hard-coded it for this 
reason, but maybe should be a parameter...

 is_host_up() {
if [ $1 !=  ]
 then
  status=`curl -s -u $user:$pass $hetzner_server/server/$1 | sed 
 's/.*status\:\([A-Za-z]*\),.*/\1/g'`
  if [ $status = ready ]
   then
return 0
   else
return 1
  fi
 This if statement can be reduced to (you save 5 lines):
   [ $status = ready ]
 else
  return 1
fi
 }

You mean the statement should be:

[ $status = ready ]  return 0
return 1

?

[...]
 Again, better (is return code of is_host_up inverted?):
   is_host_up $hostaddress
exit # this is actually also superfluous, but perhaps better left in

The action is reset, so if I had success then is_host_up must be NOT 
ready. Or not?

[...]
 Ditto.
 Good work!
 Cheers,
 Dejan
 P.S. Moving discussion to linux-ha-dev.

If the compact way is correct, I can modify the script and post it again.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Hetzner server stonith agent

2011-05-24 Thread RaSca
Il giorno Mar 24 Mag 2011 12:44:42 CET, RaSca ha scritto:
 Il giorno Mar 24 Mag 2011 12:27:04 CET, Dejan Muhamedagic ha scritto:
[...]
 P.S. Moving discussion to linux-ha-dev.
[...]

Sorry... I removed the wrong address and posted again on linux-ha :-(

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best way for colocating resource on a dual primary drbd

2011-05-16 Thread RaSca
Il giorno Lun 16 Mag 2011 09:01:08 CET, Andrew Beekhof ha scritto:
[...]
 Implicit that once the resource go away it becomes slave?
 Pretty sure this is a bug in 1.0.
 Have you tried 1.1.5 ?

Not yet, but so Andrew are you saying that keeping the colocation even 
if I have a dual primary drbd is the best thing to do?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Event: an Heartbeat/Corosync/DRBD/Pacemaker free seminar in Rho (Milan, Italy), on June 24 2011

2011-05-16 Thread RaSca
Hi all,
I hope that this message is not too off-topic, but I want to present to 
you a seminar that will take place in Rho (MI), Italy, on June 24 2011.

The title is Evoluzione dell'alta affidabilità su Linux, it will be a 
one day seminar which will be focused on the evolution of the Linux 
clustering. From Heartbeat to Pacemaker, passing by DRBD and Corosync.

There will be also a lab part with the creation of an active-active NFS 
server based on LVM and DRBD. All the project is made by MMUL 
(http://www.mmul.it) in collaboration with Linbit 
(http://www.linbit.com, thanks to Florian).

I know that the most of you are not Italian, but it maybe a good reason 
to come and see the Bel Paese.

Here are all the informations about the event: 
http://www.miamammausalinux.org/2011/05/i-seminari-di-mia-mamma-usa-linux-evoluzione-dellalta-affidabilita-su-linux/

It will be totally free, with a brief lunch included, everything offered 
by MMUL.

Let me say that without your every-day help, this event would have never 
been possible.

Have a nice day,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best way for colocating resource on a dual primary drbd

2011-05-14 Thread RaSca
Il giorno Ven 13 Mag 2011 16:09:14 CET, Viacheslav Biriukov ha scritto:
 In your case you have two drbd master. So, I think, it is not a good
 idea to create that collocation. Instead of this you can set location
 directives to locate vm-test_virtualdomain where you want to be default.
 For example:
 location L_vm-test_virtualdomain_01 vm-test_virtualdomain 100: master1.node
 location L_vm-test_virtualdomain_02 vm-test_virtualdomain 10:   master2.node

And I agree to your point of view (since I test that the colocation is 
not working). But the point is: why? I mean, the colocation defines that 
the drbd device must run in a node where drbd is Master. Why Pacemaker 
puts drbd in slave on the node in which the migration start? Does a 
colocation like this:

colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf:
vm-test_virtualdomain vm-test_ms-r0:Master

Implicit that once the resource go away it becomes slave?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Best way for colocating resource on a dual primary drbd

2011-05-13 Thread RaSca
Hi all,
I've got a setup with a dual primary DRBD with over it a KVM virtual 
machine, managed by a Virtualdomain resource.
In a classical primary-seconday setup, the declaration of the resource is:

primitive vm-test_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=240s \
op stop interval=0 timeout=100s

ms vm-test_ms-r0 vm-test_r5 \
meta master-max=1 notify=true

primitive vm-test_virtualdomain ocf:heartbeat:VirtualDomain \
params config=/etc/libvirt/qemu/vm-test.xml 
hypervisor=qemu:///system \
meta allow-migrate=false \
op monitor interval=10 timeout=30 depth=0 \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s

colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: 
vm-test_virtualdomain vm-test_ms-r0:Master

order vm-test_virtualdomain_AFTER_vm-test_ms-r0 inf: 
vm-test_ms-r0:promote vm-test_virtualdomain:start

And it's perfectly clear that with this setup I cannot have live 
migration, so, to change this I made this modifications, following 
what's written in the dual drbd's documentation on clusterlabs:

primitive vm-test_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=20 role=Master timeout=20 \
op monitor interval=30 role=Slave timeout=20 \
op start interval=0 timeout=240s \
op stop interval=0 timeout=100s

ms vm-test_ms-r0 vm-test_r0 \
meta notify=true master-max=2 interleave=true

primitive vm-test_virtualdomain ocf:heartbeat:VirtualDomain \
params config=/etc/libvirt/qemu/vm-test.xml 
hypervisor=qemu:///system migration_transport=ssh \
meta allow-migrate=true \
op monitor interval=10 timeout=30 depth=0 \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s

order vm-test_virtualdomain_AFTER_vm-test_ms-r0 inf: 
vm-test_ms-r0:promote vm-test_virtualdomain:start

And here are my doubts, because without a colocation, the vm migrate 
fine, but with a classical colocation like this:

colocation vm-test_virtualdomain_ON_vm-test_ms-r0 inf: 
vm-test_virtualdomain vm-test_ms-r0:Master

things breaks up. Because for some reason (that i don't understand) the 
destination drbd is promoted to secondary.

So, is it correct to not declare a colocation or is there a better way 
to do what I'm doing?

Thanks a lot!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd does not react as expected = split brain

2011-04-27 Thread RaSca
Il giorno Mer 27 Apr 2011 11:11:44 CET, Stallmann, Andreas ha scritto:
 Hi!
 I've two cluster-nodes, both running pingd (as a clone), to keep ressources 
 from starting on nodes which have not obvious connection to the network. The 
 ping-nodes are:
[...]
 Any ideas?
 Thanks for your help,

As far as I remember the master suggestion was to use ping instead of 
pingd, so... Try ping.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pingd does not react as expected = split brain

2011-04-27 Thread RaSca
Il giorno Mer 27 Apr 2011 12:04:57 CET, Stallmann, Andreas ha scritto:
 Hi!
 -Ursprüngliche Nachricht-
 Von: RaSca [mailto:ra...@miamammausalinux.org]
 Gesendet: Mittwoch, 27. April 2011 11:28
 As far as I remember the master suggestion was to use ping instead of pingd, 
 so... Try ping.
 Allready using ping. See for yourself:
 primitive pingy_res ocf:pacemaker:ping ...
 Any other suggestions?
 TNX,
 A.

Sorry, I didn't look at the configuration. My suggestion is (as Lars 
already said) to use the fence options of drbd. I've got a setup like 
yours, with two machines in two different places and after split brain 
situations I've never had corrupted data. I'm fencing nodes by heartbeat:

http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html

Good luck!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

2011-04-15 Thread RaSca
Il giorno Ven 15 Apr 2011 19:33:18 CET, Alessandro Iurlano ha scritto:
 For the records, thanks to the guys in @linux-ha now I have got a
 working configuration.
 The missing point was that OCFS2 and DLM were blocked waiting for a
 stonith call for the other node to end. In my configuration I had
 stonith disabled, but this seems not to affect OCFS2 and DLM.
 So the solution was to enable stonith with a plugin (I tried both
 meatware and null plugin for testing) and now the cluster seems to
 behave correctly (as of the first tests).
 Thanks!
 Alessandro

Great, Alessandro!
I'm sorry for being so late in answering to the thread, I've been a 
little busy. I'm happy to read that @linux-ha is still the best place 
for making doubts fly away.

Have a nice day!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

2011-04-04 Thread RaSca
Il giorno Sab 02 Apr 2011 19:04:08 CET, Alessandro Iurlano ha scritto:
 On Fri, Apr 1, 2011 at 11:34 AM, RaScara...@miamammausalinux.org  wrote:
 Then I tried to find a way to keep just the rmtab file synchronized on
 both nodes. I cannot find a way to have pacemaker do this for me. Is
 there one?
 As far as I know, all those operations are handled by the exportfs RA.
 I believe this was true till the backup part was removed. See the git
 commit below.

So, for some reasons this is not needed anymore, but I don't think this 
may create problems, surely the RA maintainer has done all the necessary 
tests.

 I checked the boot order and indeed I was doing it the wrong way.
 After I fixed it, a couple of tests worked right away, while the
 client hanged again when I switched back the cluster to both nodes
 online.
 Could you post your working configuration?
 Thanks,
 Alessandro

Here it is, note that I'm using DRBD instead of a shared storage 
(basically each drbd is a stand alone export that can reside 
independently on a node):

node ubuntu-nodo1
node ubuntu-nodo2
primitive drbd0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=20s timeout=40s
primitive drbd1 ocf:linbit:drbd \
params drbd_resource=r1 \
op monitor interval=20s timeout=40s
primitive nfs-kernel-server lsb:nfs-kernel-server \
op monitor interval=10s timeout=30s
primitive ping ocf:pacemaker:ping \
params host_list=172.16.0.1 multiplier=100 name=ping \
op monitor interval=20s timeout=60s \
op start interval=0 timeout=60s
primitive portmap lsb:portmap \
op monitor interval=10s timeout=30s
primitive share-a-exportfs ocf:heartbeat:exportfs \
params directory=/share-a clientspec=172.16.0.0/24 
options=rw,async,no_subtree_check,no_root_squash fsid=1 \
op monitor interval=10s timeout=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=40s
primitive share-a-fs ocf:heartbeat:Filesystem \
params device=/dev/drbd0 directory=/share-a fstype=ext3 
options=noatime fast_stop=no \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive share-a-ip ocf:heartbeat:IPaddr2 \
params ip=172.16.0.63 nic=eth0 \
op monitor interval=20s timeout=40s
primitive share-b-exportfs ocf:heartbeat:exportfs \
params directory=/share-b clientspec=172.16.0.0/24 
options=rw,no_root_squash fsid=2 \
op monitor interval=10s timeout=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=40s
primitive share-b-fs ocf:heartbeat:Filesystem \
params device=/dev/drbd1 directory=/share-b fstype=ext3 
options=noatime fast_stop=no \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s
primitive share-b-ip ocf:heartbeat:IPaddr2 \
params ip=172.16.0.64 nic=eth0 \
op monitor interval=20s timeout=40s
primitive statd lsb:statd \
op monitor interval=10s timeout=30s
group nfs portmap statd nfs-kernel-server
group share-a share-a-fs share-a-exportfs share-a-ip
group share-b share-b-fs share-b-exportfs share-b-ip
ms ms_drbd0 drbd0 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
ms ms_drbd1 drbd1 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
clone nfs_clone nfs \
meta globally-unique=false
clone ping_clone ping \
meta globally-unique=false
location share-a_on_connected_node share-a \
rule $id=share-a_on_connected_node-rule -inf: not_defined ping or 
ping lte 0
location share-b_on_connected_node share-b \
rule $id=share-b_on_connected_node-rule -inf: not_defined ping or 
ping lte 0
colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master
colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master
order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start
order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start
property $id=cib-bootstrap-options \
dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
stonith-enabled=false \
last-lrm-refresh=1301915944

Note that I've grouped all the nfs-server daemons (portmap, nfs-common 
and nfs-kernel-server) in the cloned group nfs_clone.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with an active/active NFS setup with exportfs RA

2011-04-01 Thread RaSca
Il giorno Gio 31 Mar 2011 18:42:17 CET, Alessandro Iurlano ha scritto:
 Hello.
[...]
 I have tried to put /var/lib/nfs directory on an OCFS2 filesystem
 shared by both nodes but I had a lot of stability problems with the
 nfs server processes. In particular they often seems to hang while
 starting or even stopping. I think this could be because they may be
 locking some files on the shared filesystems. As the file are kept
 locked by the daemons, further lock operation may be blocked
 indefinitely.

Having a shared /var/lib/nfs make sense with active-standby 
configurations because the nfs servers does not talk with each other, so 
I think it's normal to have instability.

 Then I tried to find a way to keep just the rmtab file synchronized on
 both nodes. I cannot find a way to have pacemaker do this for me. Is
 there one?

As far as I know, all those operations are handled by the exportfs RA.

 Also, I have found that the exportfs RA originally had a mechanism to
 keep rmtab synchronized but it has been removed in this commit:
 https://github.com/ClusterLabs/resource-agents/commit/0edb009a87f0d47b310998f2cb3809d2775e2de8
 Is there another way to accomplish this active/active setup?

I've configured a lot of A/A setup with exportfs without having 
problems. What's the boot order of your resources? Are you sure you're 
removing first of all the IP and then exportfs? I'm asking this because 
one of my problem was about the connections keep opened by the clients 
(i was removing first exportfs).

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Sort of crm commandes but off line ?

2011-03-24 Thread RaSca
Il giorno Gio 24 Mar 2011 14:32:09 CET, Alain.Moulle ha scritto:
 Hi,
 Ok I think my question was not clear : in fact, the pb is not to
 do or not sshnode  crm ... , the pb is just to know the
 hostname of the node to ssh it , in another way than parsing
 the cib.xml to know which other nodes are in the same HA cluster
 as the node where I am (knowing that corosync is stopped on this
 local node) .
 Thanks
 Regards.
 Alain
 This might sound obvious but is an ssh call acceptable?

What about:

cat /var/lib/heartbeat/crm/cib.xml|grep node id|sed -n 
's/.*uname=\\(.*\)\.*/\1/p'

?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Sort of crm commandes but off line ?

2011-03-24 Thread RaSca
Il giorno Gio 24 Mar 2011 15:11:57 CET, Alain.Moulle ha scritto:
 Thanks but that's a search in cib.xml ... I'll already have a solution
 with xpath for that:
 xpath /var/lib/heartbeat/crm/cib.xml
 /cib/configuration/nodes/node/@uname 2/dev/null
 etc.
 My question was slightly different ... but anyway, I'll parse cib.xml
 Thanks a lot.
 Regards
 Alain

Sorry, I don't think I really understood what do you want to obtain, but 
if a node is not connected to tha cluster, then i don't think it's 
possible to look for this kind of information anywhere else than 
checking the xml.

What about creating a shared area (external to the cluster, like an nfs 
mount) in which the cluster (with crm_mon resource agent) puts the 
cluster state (in html, for example)?

If a node is not connected to the cluster, than it should simply look in 
this area, parse the html (or txt, or whatever) and then understand who 
is in the cluster...

This should be real time informations, and not (possible) old xml file...

Hope this helps,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-22 Thread RaSca
Il giorno Mar 22 Mar 2011 10:32:10 CET, Christoph Bartoschek ha scritto:
[...]
 The linbit.com document Highly available NFS storage with DRBD and
 Pacemaker suggests to use the lsb:nfs-kernel-server resource while the
 wiki suggests ocf:heartbeat:nfsserver.
 Which one is the better one and what are the advantages?
 Christoph

exportfs must not be used with nfsserver: they make the same things in 
two different ways. The first manages the exports on an already active 
nfs server, the second manages the nfs process (and here you must have a 
shared /var/lib/nfs between nodes).
So the choice is yours.
I personally prefer exportfs because is a little bit simple and all you 
have to do is configure the primitive. There is also another one big 
thing that exportfs have and nfsserver not: suppose you want to create a 
cluster nfs server in which every node shares an export: this is very 
simple with exportfs, but basically impossible with nfsserver (unless 
you split you configuration, your nfs daemon and so on).

Have a nice day.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-22 Thread RaSca
Il giorno Mar 22 Mar 2011 11:01:25 CET, Caspar Smit ha scritto:
 I basically had the same question as Christopher :)
 Thanks for the great clarification RaSca!
 I'll go for exportfs and lsb:nfs-kernel-server for sure.
 Kind regards,
 Caspar Smit

One is glad to be of service :-)
In my experience, I always create an nfs group composed by portmap, 
nfs-common and nfs-kernel-server (all lsb resources) and cloned in every 
node. Then I create the sub groups of single exports composed by the 
share (local, iscsi or whatever), exportfs and virtual-ip.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-22 Thread RaSca
Il giorno Mar 22 Mar 2011 12:02:18 CET, Caspar Smit ha scritto:
[...]
 Thanks for this tip, why would the linbit.com document not mention
 nfs-common and portmap? Are these only needed in specific situations or are
 they always needed(I'm using debian lenny)?

I think that this is because those are considered system services, 
something that is always active (if you want to mount nfs resources you 
need both portmap and nfs-common).

 And do cloned resources (portmap, nfs-common and nfs-kernel-server) really
 have to be in a group? (Since they are cloned and the service runs on all
 nodes anyway)
 Kind regards,
 Caspar

I made groups just to preserve a logical order: first portmap, then 
nfs-common and then nfs-kernel-server.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-22 Thread RaSca
Il giorno Mar 22 Mar 2011 13:50:19 CET, Caspar Smit ha scritto:
 Could you maybe post (a snippet of) your crm configuration (the part
 concerning NFS)?
 This could be a great help for other users as well I think.
 Thanks in advance
 Kind regards,
 Caspar Smit

Hi Caspar,
all my experiences are inside the article I wrote for my technical 
portal miamammausalinux.org:

http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/

Here you will find everything you need, the only problem is that all is 
written in Italian... Of course, translators are welcome ;-)

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-22 Thread RaSca
Il giorno Mar 22 Mar 2011 15:11:34 CET, Caspar Smit ha scritto:
 Thanks for this, but I don't see any portmap, nfs-common and/or
 nfs-kernel-server lsb scripts used in that article.
 That was just the part I was interested in :)
 Kind regards
 Caspar Smit

You're absolutely right :-) Take a look at this configuration:

node ubuntu-nodo1
node ubuntu-nodo2
primitive drbd0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=20s timeout=40s \
meta target-role=started
primitive drbd1 ocf:linbit:drbd \
params drbd_resource=r1 \
op monitor interval=20s timeout=40s
primitive nfs-kernel-server lsb:nfs-kernel-server \
op monitor interval=10s timeout=30s
primitive ping ocf:pacemaker:ping \
params host_list=172.16.0.1 multiplier=100 name=ping \
op monitor interval=20s timeout=60s \
op start interval=0 timeout=60s
primitive portmap lsb:portmap \
op monitor interval=10s timeout=30s
primitive share-a-exportfs ocf:heartbeat:exportfs \
params directory=/share-a clientspec=172.16.0.0/24 
options=rw,async,no_subtree_check,no_root_squash fsid=1 \
op monitor interval=10s timeout=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=40s \
meta is-managed=true target-role=started
primitive share-a-fs ocf:heartbeat:Filesystem \
params device=/dev/drbd0 directory=/share-a fstype=ext3 
options=noatime fast_stop=no \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s \
meta is-managed=true target-role=started
primitive share-a-ip ocf:heartbeat:IPaddr2 \
params ip=172.16.0.63 nic=eth0 \
op monitor interval=20s timeout=40s \
meta is-managed=true target-role=started
primitive share-b-exportfs ocf:heartbeat:exportfs \
params directory=/share-b clientspec=172.16.0.0/24 
options=rw,no_root_squash fsid=2 \
op monitor interval=10s timeout=30s \
op start interval=0 timeout=40s \
op stop interval=0 timeout=40s \
meta target-role=started
primitive share-b-fs ocf:heartbeat:Filesystem \
params device=/dev/drbd1 directory=/share-b fstype=ext3 
options=noatime fast_stop=no \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s \
meta target-role=started
primitive share-b-ip ocf:heartbeat:IPaddr2 \
params ip=172.16.0.64 nic=eth0 \
op monitor interval=20s timeout=40s \
meta target-role=started
primitive statd lsb:statd \
op monitor interval=10s timeout=30s
group nfs portmap statd nfs-kernel-server
group share-a share-a-fs share-a-exportfs share-a-ip
group share-b share-b-fs share-b-exportfs share-b-ip
ms ms_drbd0 drbd0 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
ms ms_drbd1 drbd1 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
clone nfs_clone nfs \
meta globally-unique=false
clone ping_clone ping \
meta globally-unique=false
location share-a_on_connected_node share-a \
rule $id=share-a_on_connected_node-rule -inf: not_defined ping or 
ping lte 0
location share-b_on_connected_node share-b \
rule $id=share-b_on_connected_node-rule -inf: not_defined ping or 
ping lte 0
colocation share-a_on_ms_drbd0 inf: share-a ms_drbd0:Master
colocation share-b_on_ms_drbd1 inf: share-b ms_drbd1:Master
order share-a_after_ms_drbd0 inf: ms_drbd0:promote share-a:start
order share-b_after_ms_drbd1 inf: ms_drbd1:promote share-b:start
property $id=cib-bootstrap-options \
dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
no-quorum-policy=ignore \
stonith-enabled=false \
last-lrm-refresh=1300276063

I omitted the nfs server part in my article because explaining also that 
part would make the article even longer...

Bye,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover NFS using Pacemaker.

2011-03-21 Thread RaSca
Il giorno Lun 21 Mar 2011 16:27:35 CET, Caspar Smit ha scritto:
 Hi,
[...]
 In this document there is no mention of the /var/lib/nfs directory but in
 stead a new resource agent (exportfs)
 Does this exportfs resource agent deprecate the need for a shared
 /var/lib/nfs or do I still need to do that?

Hi,
no, exportfs automatically creates just two hidden files in which are 
stored the exportfs process pid and the rmtab informations.

 ps. What about the nfsserver resource agent? Will I need that too?

You will need to have an NFS server running. exportfs RA populates the 
system exportfs table. It's your choice to put nfs server under the 
control of the cluster (maybe with a cloned resource group composed by 
portmap, nfs-common and nfs-kernel-server) or not.

Bye,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration

2011-03-11 Thread RaSca
Il giorno Ven 11 Mar 2011 07:32:32 CET, Randy Katz ha scritto:
 ps - in /var/log/messages I find this:

 Mar 10 22:31:45 drbd1 lrmd: [3274]: ERROR: get_resource_meta: pclose
 failed: Interrupted system call
 Mar 10 22:31:45 drbd1 lrmd: [3274]: WARN: on_msg_get_metadata: empty
 metadata for ocf::linbit::drbd.
 Mar 10 22:31:45 drbd1 lrmadmin: [3481]: ERROR:
 lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a reply
 message of rmetadata with function get_ret_from_msg.
[...]

Hi,
I think that the message no such resource agent is explaining what's 
the matter.
Does the file /usr/lib/ocf/resource.d/linbit/drbd exists? Is the drbd 
file executable? Have you correctly installed the drbd packages?

Check those things, you can try to reinstall drbd.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration

2011-03-11 Thread RaSca
Il giorno Ven 11 Mar 2011 10:36:25 CET, Randy Katz ha scritto:
[...]
 Hi
 # ls -l /usr/lib/ocf/resource.d/linbit/drbd
 -rwxr-xr-x 1 root root 24523 Jun 4 2010 /usr/lib/ocf/resource.d/linbit/drbd
 DRBD is running fine, I have setup that part of it already. I am using
 the ha-scsi.pdf and up to this
 point everything is fine.
 Randy

This is a little bit strange. If you are sure about the drbd setup than 
the no such resource agent error should not be present. What is your 
pacemaker version? It might be a bug (on google there are some cases of 
this kind of problems that are bugs).
You can even try to run the resource agent manually, by going into the 
/usr/lib/ocf/resource.d/linbit/ and setting the environmental variables 
needed, and see what happens.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] question on Creating an Active/Passive iSCSI configuration

2011-03-11 Thread RaSca
Il giorno Ven 11 Mar 2011 11:15:03 CET, RaSca ha scritto:
[...]
 This is a little bit strange. If you are sure about the drbd setup than
 the no such resource agent error should not be present. What is your
 pacemaker version? It might be a bug (on google there are some cases of
 this kind of problems that are bugs).
 You can even try to run the resource agent manually, by going into the
 /usr/lib/ocf/resource.d/linbit/ and setting the environmental variables
 needed, and see what happens.

Another thing... Have you tried declaring also the Master-slave resource 
for that drbd and see what happens?

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-25 Thread RaSca
Il giorno Mar 18 Gen 2011 12:06:36 CET, Dejan Muhamedagic ha scritto:
[...]
 Good. Thanks for investigating. As we've already discussed
 yesterday in irc, the iscsi patch I attached last week contained
 a bug which has been fixed now in the repository. Perhaps that
 bug caused the further problems which you experienced, so it
 doesn't actually have to do anything with iscsiadm discovery. Can
 you please test that.
[...]

Finally I've tested it. It works. So you can choose now to include or 
not the discovery selection patch I made.
But at this time I can confirm that the latest iscsi resource agent 
works great.

Have a great day!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with dependent groups at cluster startup

2011-01-19 Thread RaSca
Il giorno Mer 19 Gen 2011 15:55:30 CET, Andrew Beekhof ha scritto:
[...]
 Not sure where you are with this, but these logs indicate that some
 things are already running (rc=0) when the cluster starts:
 Jan 11 15:11:52 SE4 crmd: [32095]: info: match_graph_event: Action
 db_iscsi-lun_monitor_0 (7) confirmed on se4 (rc=0)
 Jan 11 15:11:52 SE4 crmd: [32095]: info: match_graph_event: Action
 db_iscsi-target_monitor_0 (6) confirmed on se4 (rc=0)
 This may create the impression that pacemaker started things in the
 wrong order, even though it didn't.

As discussed with Dejan we find out a solution. There's something with 
the discovery that fails, but now the RA is patched. Look at the thread 
Here again with my problem with iscsi resource agent.

Thanks a lot for your help!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Is 'resource_set' still experimental?

2011-01-18 Thread RaSca
Il giorno Mar 18 Gen 2011 11:20:13 CET, Florian Haas ha scritto:
 On 01/04/2011 11:56 AM, Tobias Appel wrote:
 On 12/28/2010 06:46 PM, Dejan Muhamedagic wrote:
 40 order constraints? A big cluster.
 We have currently 40 VM's (XEN) on it. I can't put them in a group since
 they have to run independently and not necessarily on the same node(s).
 And setting meta ordered=false colocated=false on the group is not an
 option?
 Florian

As discussed yesterday on IRC with Andrew, there is no way of creating a 
group with indipendent resources.
I was hoping that setting the options you mentioned can do the trick, 
but I've just tested:

If you declare a group like this:

group groupA resA resB resC meta ordered=false colocated=false

and then you do a:

crm resource stop resB, then resC is also stopped. So the only way for 
make this setup work is to declare an order+colocation set with parenthesis.

Please correct me if I'm wrong or if I've misunderstood what you wrote.

--
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat moves the resources when heartbeat starts on a second node

2011-01-18 Thread RaSca
Il giorno Mar 18 Gen 2011 12:13:15 CET, Erik Dobák ha scritto:
 Hi people,
 i got my active/passive cluster running. when i start the first node all
 resources are started. but when i start the second node, all resources are
 stoped on the first node and started on the second node. why? do i something
 wrong?
[...]
 cheers
 E

You have to take a look to resource stickiness and placement, it's all 
covered in the configuration explained docs.

Bye,

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] no quorum problem

2011-01-18 Thread RaSca
Il giorno Mar 18 Gen 2011 16:11:13 CET, Pavlos Polianidis ha scritto:
 Dear Andrew
 So is there any solution to make the quorum operate?
 Thanks in advance
 Pavlos Polianidis

How your no-quorum-policy is set?

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-16 Thread RaSca

Il giorno Ven 14 Gen 2011 17:34:10 CET, RaSca ha scritto:
[...]

I can say for sure that we will surely know. As you can see, in all of
my posts (and projects) I never give up until there is a clear solution
to the problem. I will find out also in this case.
Thanks again,


Dejan, here I am.
Attached to this mail you can find a patch to your original iscsi RA, in 
which I've added the possibility to do or do not the discovery 
(discovery_enable parameter). In this way, the RA works perfectly and I 
can use it in my environment.
As I first supposed there's something in the code of the discovery that 
make things break.
Note that I will continue my investigation on the discover problem, but 
in this way it's possible for me to use the original RA with my patch 
and, of course, the parameter discovery_enable set to no.
Maybe, since this is an optional parameter, it can be included in the 
official RA.


Hoping that you're unhappy anymore ;-)

--
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

*** heartbeat/iscsi	2011-01-16 13:26:12.0 +0100
--- rasca/iscsi	2011-01-16 13:27:57.0 +0100
***
*** 31,36 
--- 31,37 
  #	OCF_RESKEY_portal: the iSCSI portal address or host name (required)
  #	OCF_RESKEY_target: the iSCSI target (required)
  #	OCF_RESKEY_iscsiadm: iscsiadm program path (optional)
+ #	OCF_RESKEY_discovery_enable: enable discovery? (default: yes)
  #	OCF_RESKEY_discovery_type: discovery type (optional; default: sendtargets)
  #
  # Initialization:
***
*** 87,92 
--- 88,101 
  content type=string default= /
  /parameter
  
+ parameter name=discovery_enable unique=0 required=0
+ longdesc lang=en
+ Enable discovery? In some cases doing the discovery can break things on startup.
+ /longdesc
+ shortdesc lang=endiscovery_enable/shortdesc
+ content type=string default=yes /
+ /parameter
+ 
  parameter name=discovery_type unique=0 required=0
  longdesc lang=en
  Discovery type. Currently, with open-iscsi, only the sendtargets
***
*** 179,187 
  #   3: iscsiadm returned error
  
  open_iscsi_discovery() {
! 	output=`$iscsiadm -m discovery -p $OCF_RESKEY_portal -t $discovery_type`
  	if [ $? -ne 0 -o x = x$output ]; then
! 		[ x != x$output ]  echo $output
  		return 3
  	fi
  	portal=`echo $output |
--- 188,205 
  #   3: iscsiadm returned error
  
  open_iscsi_discovery() {
! 	local output
! 	local portal
! 	local severity=err
! 	local cmd=$iscsiadm -m discovery -p $OCF_RESKEY_portal -t $discovery_type
! 
! 	ocf_is_probe  severity=info
! 	output=`$cmd`
  	if [ $? -ne 0 -o x = x$output ]; then
! 		[ x != x$output ]  {
! 			ocf_log $severity $cmd FAILED
! 			echo $output
! 		}
  		return 3
  	fi
  	portal=`echo $output |
***
*** 196,202 
  	case `echo $portal | wc -w` in
  	0) #target not found
  		echo $output
! 		ocf_log err target $OCF_RESKEY_target not found at portal $OCF_RESKEY_portal
  		return 1
  	;;
  	1) #we're ok
--- 214,220 
  	case `echo $portal | wc -w` in
  	0) #target not found
  		echo $output
! 		ocf_log $severity target $OCF_RESKEY_target not found at portal $OCF_RESKEY_portal
  		return 1
  	;;
  	1) #we're ok
***
*** 336,343 
--- 354,366 
  	exit $OCF_ERR_PERM
  fi
  
+ discovery_enable=${OCF_RESKEY_discovery_enable:-yes}
+ [ $discovery_enable != yes ]  portal=$OCF_RESKEY_portal
  discovery_type=${OCF_RESKEY_discovery_type:-sendtargets}
  udev=${OCF_RESKEY_udev:-yes}
+ 
+ if [ $discovery_enable = yes ]
+ then
  $discovery  # discover and setup the real portal string (address)
  case $? in
  0) ;;
***
*** 346,353 
 [ $1 = status ]  exit $LSB_STATUS_STOPPED
 exit $OCF_ERR_GENERIC
  ;;
! [23]) exit $OCF_ERR_GENERIC;;
  esac
  
  # which method was invoked?
  case $1 in
--- 369,380 
 [ $1 = status ]  exit $LSB_STATUS_STOPPED
 exit $OCF_ERR_GENERIC
  ;;
! 2) exit $OCF_ERR_GENERIC;;
! 3) ocf_is_probe  exit $OCF_NOT_RUNNING
!exit $OCF_ERR_GENERIC
! ;;
  esac
+ fi
  
  # which method was invoked?
  case $1 in

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-14 Thread RaSca

Il giorno Ven 14 Gen 2011 12:18:38 CET, Dejan Muhamedagic ha scritto:
[...]

iscsiadm fails with the following message:

Jan 13 19:15:46 debian-squeeze-nodo1 lrmd: [1274]: info: RA output: 
(www_db-iscsi:start:stderr) iscsiadm:
Jan 13 19:15:46 debian-squeeze-nodo1 lrmd: [1274]: info: RA output: 
(www_db-iscsi:start:stderr) no records found!

Try to figure out what does that mean, my iscsi is a bit rusty.


Hey Dej,
finally I made things working by rewriting the resource agent. I was 
unable to do a true debug, so I cleaned up (for me, of course) your code.
You can find it attached to this mail. It works perfectly in my 
topology, but it has some limitations:


- It runs only on Linux (while your original was meant also for 
different situations);
- It runs only with udev (I don't make any udev check like in the 
original RA);

- It's obviously a scratch;

Hope this can help improving the things in some way. Let me know if I 
can do something else.


Thanks a lot for your precious support!

--
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
#!/bin/sh
#
# iSCSI OCF resource agent
# Description: manage iSCSI disks (add/remove) using open-iscsi
#
# Developed by RaSca (ra...@miamammausalinux.org)
# Copyright (C) 2010 MMUL S.a.S.,  All Rights Reserved.
# Based upon the Resource Agent iscsi by Dejan Muhamedagic de...@suse.de
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#
# See usage() and meta_data() below for more details...
#
# OCF instance parameters:
#   OCF_RESKEY_portal: the iSCSI portal address or host name (required)
#   OCF_RESKEY_target: the iSCSI target (required)
#   OCF_RESKEY_iscsiadm: iscsiadm program path (optional)
#
# Initialization:

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

usage() {
  methods=`iscsi_methods`
  methods=`echo $methods | tr ' ' '|'`
  cat -!
usage: $0 {$methods}

$0 manages an iSCSI target

The 'start' operation starts (adds) the iSCSI target.
The 'stop' operation stops (removes) the iSCSI target.
The 'status' operation reports whether the iSCSI target is connected
The 'monitor' operation reports whether the iSCSI target is connected
The 'validate-all' operation reports whether the parameters are valid
The 'methods' operation reports on the methods $0 supports

!
}

meta_data() {
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=iscsi
version1.0/version

longdesc lang=en
OCF Resource Agent for iSCSI. Add (start) or remove (stop) iSCSI
targets.
/longdesc
shortdesc lang=enManages a local iSCSI initiator and its connections to 
iSCSI targets/shortdesc

parameters

parameter name=portal unique=0 required=1
longdesc lang=en
The iSCSI portal address in the form: {ip_address|hostname}[:port]
/longdesc
shortdesc lang=enportal/shortdesc
content type=string default= /
/parameter

parameter name=target unique=1 required=1
longdesc lang=en
The iSCSI target.
/longdesc
shortdesc lang=entarget/shortdesc
content type=string default= /
/parameter

parameter name=iscsiadm unique=0 required=0
longdesc lang=en
iscsiadm program path.
/longdesc
shortdesc lang=eniscsiadm/shortdesc
content type=string default= /
/parameter

/parameters

actions
action name=start timeout=120 /
action name=stop timeout=120 /
action name=status timeout=30 /
action name=monitor depth=0 timeout=30 interval=120 /
action name=validate-all timeout=5 /
action name=methods timeout=5 /
action name=meta-data timeout=5 /
/actions
/resource-agent
END
}

open_iscsi_methods() {
  cat -!
start
stop
status
monitor
validate-all
methods
meta-data
usage
!
}

open_iscsi_daemon() {
if ps -e -o cmd | grep -qs '[i]scsid'; then
return 0
else
ocf_log err iscsid not running; please start open-iscsi 
utilities
return 1
fi

Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-14 Thread RaSca
Il giorno Ven 14 Gen 2011 16:07:42 CET, Dejan Muhamedagic ha scritto:
[...]
 Unfortunately, it cannot really improve situation in any way.
 Dropping half the agent and not knowing where is the problem is
 not helpful. At least not if you want to share your findings
 with the community.

I perfectly know, but my goal for now was to solve my problem, as I 
repeated (too) many times in these days, I was in trouble with a 
customer and simplifying the script was the first step to understand how 
this was made.

 One of the possible explanations is that the target simply
 wasn't ready at the time it tried to start (as it has only be
 made available by the previous service). So, perhaps inserting a
 Delay resource in between could help here. If so, then maybe
 iSCSI* agents need to make sure that the service is really ready
 after start exits.

Consider that I have reproduced a virtual environment that can be used 
for doing all of the tests needed.
I don't think that delay is the solution, but I agree with you that 
maybe reviewing the discovery code issue can help us.

 Looking at the diff, though it's simply impossible to figure out
 what changed because so many things did, it looks like you
 dropped the discovery code (iscsiadm -m discovery). Was that
 where you had problems? Well, we'll probably never know for
 sure.
 Unhappy,
 Dejan

I can say for sure that we will surely know. As you can see, in all of 
my posts (and projects) I never give up until there is a clear solution 
to the problem. I will find out also in this case.

Thanks again,

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-13 Thread RaSca

Il giorno Gio 13 Gen 2011 13:22:26 CET, Dejan Muhamedagic ha scritto:

Hi,


Hi Dej and thanks as usual for you precious support.

[...]

iscsiadm fails in probe with the following messages:

Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm:
Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) cannot make connection to 10.0.0.100:3260 (113)
Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr)
Jan 13 12:15:19 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm: connection to discovery address 
10.0.0.100 failed
Jan 13 12:15:22 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 
(113)#012iscsiadm: connection to discovery address 10.0.0.100 failed
Jan 13 12:15:26 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 
(113)#012iscsiadm: connection to discovery address 10.0.0.100 failed
Jan 13 12:15:29 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 
(113)#012iscsiadm: connection to discovery address 10.0.0.100 failed
Jan 13 12:15:33 debian-squeeze-nodo2 lrmd: [12063]: info: RA output: 
(www_db-iscsi:probe:stderr) iscsiadm: cannot make connection to 10.0.0.100:3260 
(113)#012iscsiadm: connection to discovery address 10.0.0.100 
failed#012iscsiadm: connection login retries (reopen_max) 5 
exceeded#012iscsiadm: Could not perform SendTargets discovery.

The ip (10.0.0.100) is not up yet at that point. Are you sure
that your configuration is sane. Looks somewhat strange to me.


My configuration looks to me (!) fine. As I said, the problems comes up 
only in the startup and not with normal failovers. What particularly is 
strange to you?



Anyway, iscsi (the RA) should be more forgiving in probes. Can
you please try the attached patch.


The patch applies, but the resource fails to startup with:

Failed actions:
www_db-iscsi_monitor_0 (node=debian-squeeze-nodo1, call=39, rc=4, 
status=complete): insufficient privileges


the current log is attached, even if I try a cleanup it ends again with 
the insufficient privilege message.



Hmm, couldn't find that in the logs. The first action on the
iscsi resource is probe and that fails.
Thanks,
Dejan


And this is the strange thing for me: Why iscsi is probed if db group is 
not yet active?


--
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
Jan 13 14:37:19 debian-squeeze-nodo1 cibadmin: [1601]: info: Invoked: cibadmin 
-Ql -o resources 
Jan 13 14:37:19 debian-squeeze-nodo1 cibadmin: [1603]: info: Invoked: cibadmin 
-Ql -o resources 
Jan 13 14:37:19 debian-squeeze-nodo1 crm_resource: [1605]: info: Invoked: 
crm_resource -C -r www -H debian-squeeze-nodo1 
Jan 13 14:37:19 debian-squeeze-nodo1 crm_resource: [1605]: ERROR: 
unpack_rsc_op: Hard error - www_db-iscsi_monitor_0 failed with rc=4: Preventing 
www_db-iscsi from re-starting on debian-squeeze-nodo1
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: 
Removing resource www_db-iscsi from the LRM
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: 
Resource 'www_db-iscsi' deleted for 1605_crm_resource on debian-squeeze-nodo1
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: notify_deleted: 
Notifying 1605_crm_resource on debian-squeeze-nodo1 that www_db-iscsi was 
deleted
Jan 13 14:37:19 debian-squeeze-nodo1 cib: [30073]: info: cib_process_request: 
Operation complete: op cib_delete for section 
//node_state[@uname='debian-squeeze-nodo1']//lrm_resource[@id='www_db-iscsi'] 
(origin=local/crmd/175, version=0.303.9): ok (rc=0)
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: send_direct_ack: 
ACK'ing resource op www_db-iscsi_delete_6 from 0:0:crm-resource-1605: 
lrm_invoke-lrmd-1294925839-103
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: 
abort_transition_graph: te_update_diff:267 - Triggered transition abort 
(complete=1, tag=lrm_rsc_op, id=www_db-iscsi_monitor_0, 
magic=0:4;8:14:7:63667b9f-9576-4a57-b9b9-1f53d01ca8b7, cib=0.303.9) : Resource 
op removal
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_state_transition: 
State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_state_transition: 
All 1 cluster nodes are eligible to run resources.
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_pe_invoke: Query 
178: Requesting the current CIB: S_POLICY_ENGINE
Jan 13 14:37:19 debian-squeeze-nodo1 crmd: [30077]: info: do_lrm_invoke: 
Removing resource www_db-fs from the LRM
Jan 13 14:37:19 

Re: [Linux-HA] Here again with my problem with iscsi resource agent

2011-01-13 Thread RaSca
Il giorno Gio 13 Gen 2011 13:57:28 CET, Dejan Muhamedagic ha scritto:
 Hi,
[...]
 The patch applies, but the resource fails to startup with:
 Failed actions:
  www_db-iscsi_monitor_0 (node=debian-squeeze-nodo1, call=39,
 rc=4, status=complete): insufficient privileges
 +/usr/lib/ocf/resource.d//heartbeat/iscsi: Permission denied
 Try chmod +x /usr/lib/ocf/resource.d//heartbeat/iscsi :)

Shame on me. I'm an idiot :)

Now seems to work. On the log I can see some message like this:

Jan 13 15:56:09 debian-squeeze-nodo1 iscsid: connect to 10.0.0.100:3260 
failed (No route to host)

until the db resource comes up, then the iscsi resource comes up correctly.
But now there's another problem with the resource next to this one: the 
first time the filesystem comes up, it fails, with this error:

Jan 13 16:13:20 debian-squeeze-nodo1 Filesystem[8207]: [8250]: INFO: 
Running start for 
/dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1
 
on /
db
Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr) FATAL: Module scsi_hostadapter not found.
Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770010] sd 19:0:0:1: 
[sdc] Unhandled error code
Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770056] sd 19:0:0:1: 
[sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK
Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770123] sd 19:0:0:1: 
[sdc] CDB: Read(10): 28 00 00 00 00 3e 00 00 02 00
Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.770361] end_request: 
I/O error, dev sdc, sector 62
Jan 13 16:13:20 debian-squeeze-nodo1 kernel: [19980.771595] EXT3-fs: 
unable to read superblock
Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr) mount: wrong fs type, bad option, bad 
superblock on /dev/sdc1,#012   missing codepage or helpe
r program, or other error
Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr)
Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr)In some cases useful info is found in 
syslog - try#012   dmesg | tail  or so
Jan 13 16:13:20 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr)
Jan 13 16:13:20 debian-squeeze-nodo1 Filesystem[8207]: [8266]: ERROR: 
Couldn't mount filesystem 
/dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-p
art1 on /db
Jan 13 16:13:20 debian-squeeze-nodo1 crmd: [7400]: info: 
process_lrm_event: LRM operation www_db-fs_start_0 (call=32, rc=1, 
cib-update=57, confirmed=true) unknown error

But, even if the system says that FATAL: Module scsi_hostadapter not 
found., if I do a cleanup of the resource it comes up without other 
problems:

Jan 13 16:20:26 debian-squeeze-nodo1 Filesystem[11389]: [11444]: INFO: 
Running start for 
/dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1
 
on /db
Jan 13 16:20:26 debian-squeeze-nodo1 lrmd: [7397]: info: RA output: 
(www_db-fs:start:stderr) FATAL: Module scsi_hostadapter not found.
Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.470761] kjournald 
starting.  Commit interval 5 seconds
Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.478976] EXT3 FS on 
sdc1, internal journal
Jan 13 16:20:27 debian-squeeze-nodo1 kernel: [20407.479074] EXT3-fs: 
mounted filesystem with ordered data mode.

So everything is ok. The filesystem resource is declared in this way:

primitive www_db-fs ocf:heartbeat:Filesystem \
params 
device=/dev/disk/by-path/ip-10.0.0.100:3260-iscsi-iqn.2010-12.local.rascanet:db.rascanet.iscsi-lun-1-part1
 
directory=/db fstype=ext3 \
op monitor interval=20s timeout=40s \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s

What should be the problem?

[...]
 All resources are probed at startup regardless of dependencies.
 It's up to the resource agents to manage such situations.
 Thanks,
 Dejan

Ok, now it's clear.

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Problem with dependent groups at cluster startup

2011-01-11 Thread RaSca
Hi all,
I've got two group of resources, say A and B. A depends on B, so if B 
isn't up A must NOT be started. This is the situation:

group B Bres1 Bres2 Bres3

colocation B_ON_B_ms-r1 inf: B B_ms-r1:Master
order B_AFTER_B_ms-r1 inf: B_ms-r1:promote B:start

group A  Ares1 Ares2 Ares3

colocation A_ON_A_ms-r0 inf: A A_ms-r0:Master
order A_AFTER_A_ms-r0 inf: A_ms-r0:promote A:start

order A_AFTER_B inf: B:start A:start

In a running configuration everything works fine, all the services 
switch in case of failures or if I force a manual move.
The problem is on startup, because for some reason Pacemaker (1.0.10) 
start first of all Ares1, that of course fails because B is not started.

I think that the last order constraint must obligate A to not start 
before B is done, but things are not going in this way.

What am I missing?

Thanks a lot!

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with dependent groups at cluster startup

2011-01-11 Thread RaSca
Il giorno Mar 11 Gen 2011 13:36:58 CET, RaSca ha scritto:
 Hi all,
 I've got two group of resources, say A and B. A depends on B, so if B
 isn't up A must NOT be started. This is the situation:
 group B Bres1 Bres2 Bres3
 colocation B_ON_B_ms-r1 inf: B B_ms-r1:Master
 order B_AFTER_B_ms-r1 inf: B_ms-r1:promote B:start
 group A  Ares1 Ares2 Ares3
 colocation A_ON_A_ms-r0 inf: A A_ms-r0:Master
 order A_AFTER_A_ms-r0 inf: A_ms-r0:promote A:start
 order A_AFTER_B inf: B:start A:start
 In a running configuration everything works fine, all the services
 switch in case of failures or if I force a manual move.
 The problem is on startup, because for some reason Pacemaker (1.0.10)
 start first of all Ares1, that of course fails because B is not started.
 I think that the last order constraint must obligate A to not start
 before B is done, but things are not going in this way.
 What am I missing?
 Thanks a lot!

To be more specifically, this is the log of what happens when I startup 
just one node: http://pastebin.com/YsP4B94r and this is my 
configuration: http://pastebin.com/77gwm1Gr

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ha for 2 jboss instancies in an active/passive cluster

2011-01-07 Thread RaSca
Il giorno Mar 04 Gen 2011 12:13:48 CET, Erik Dobák ha scritto:
[...]
 i have found this http://www.linux-ha.org/doc/re-ra-jboss.html but am simply
 not able to understand where this:
 primitive example_jboss ocf:heartbeat:jboss \
params \
  jboss_home=*string* \
op monitor depth=0 timeout=30s interval=10s
 belongs.
 i would be thankfull for some better jboss example or a few hints.
 E

First of all: are you using Pacemaker or not? If you're using Heartbeat 
without CRM, then you might consider to upgrade to Pacemaker or to 
create two different init scripts for each jboss instance and then 
configure them into /etc/ha.d/haresources.

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Dependent groups of resources that can reside on different nodes

2010-12-22 Thread RaSca
Il giorno Lun 20 Dic 2010 13:20:42 CET, RaSca ha scritto:
[...]
 What am I missing?

As discussed with Andrew on IRC the problem is fixed by removing the 
parenthesis from the order and colocation's declarations.

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Dependent groups of resources that can reside on different nodes

2010-12-20 Thread RaSca
Hi all,
I've got two group of resources, say A and B. Suppose this is the situation:

group A:

colocation B_ON_B_ms-r1 inf: ( B_ip ) ( B_iscsitarget-lun ) ( 
B_iscsitarget-export ) B_ms-r1:Master
order B_AFTER_B_ms-r1 inf: B_ms-r1:promote ( B_iscsitarget-export:start 
) ( B_iscsitarget-lun:start ) ( B_ip:start )

group B:

colocation A_ON_A_ms-r0 inf: ( A_ip ) ( A_fs ) ( A_db-fs ) ( A_iscsi-db 
) A_ms-r0:Master
order A_AFTER_A_ms-r0 inf: A_ms-r0:promote ( A_iscsi-db:start ) ( 
A_db-fs:start ) ( A_fs:start ) ( A_ip:start )

The point is that both resources can reside on different nodes, but A 
depends on B, so I want that if B switch, then A must be stopped, then B 
must be started and after that A is restarted.
I've declared this order constraint, but it doesn't do what I'm expecting:

order A_after_B inf: ( B_iscsitarget-export:start ) ( 
B_iscsitarget-lun:start ) ( B_ip:start ) ( A_iscsi-db:start ) ( 
A_db-fs:start ) ( A_fs:start ) ( A_ip:start )

What am I missing?

Thanks a lot!

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] custom jboss init script on pacemaker

2010-11-30 Thread RaSca
Il giorno Mar 30 Nov 2010 11:38:39 CET, Michael Kromer ha scritto:
 Hi,
 I've never seen a real LSB-conform init script of jboss, but the one
 getting real close I know is
 http://www.riccardoriva.com/shared-files/jboss_init_script
 You might need to strip out the oracle stuff defined in there, but it
 should be a good starting point, as it handles status) the way LSB asks
 for it.
 - mike

I can confirm that the RA shipped with pacemaker 1.0.9 and 1.0.10 works 
great for me too.

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] custom jboss init script on pacemaker

2010-11-30 Thread RaSca
Il giorno Mar 30 Nov 2010 11:55:50 CET, Michael Kromer ha scritto:
 right, for reference:
 http://www.linux-ha.org/doc/re-ra-jboss.html
 I just recommend to take a safe look at the timeouts, as 60s could be
 too short for some larger applications.
 - mike

I confirm. I had to put 240s in some cases, jboss sometimes is very slow.

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Anything resource agent and workdir

2010-10-29 Thread RaSca

Hi Guys,
working with some java batch programs I needed to configure the anything 
resource agent. I found that there's no way to define the working 
directory from where the executable must be launched.


So I solved my problem patching the resource agent to support the 
workdir parameter, by writing this patch:


I don't know if this is the best solution (any suggestion will be 
appreciated), but for me it worked, so I share it (you can find it 
attached).


Bye,

--
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
mobile: +393281776712
ra...@miamammausalinux.org
http://www.miamammausalinux.org
--- ../heartbeat/anything	2010-07-15 11:26:18.0 +0200
+++ anything	2010-10-29 11:04:44.0 +0200
@@ -27,6 +27,7 @@
 # OCF instance parameters
 #   OCF_RESKEY_binfile
 #   OCF_RESKEY_cmdline_options
+#	OCF_RESKEY_workdir
 #	OCF_RESKEY_pidfile
 #   OCF_RESKEY_logfile
 #   OCF_RESKEY_errlogfile
@@ -34,7 +35,7 @@
 #   OCF_RESKEY_monitor_hook
 #   OCF_RESKEY_stop_timeout
 #
-# This RA starts $binfile with $cmdline_options as $user and writes a $pidfile from that. 
+# This RA starts $binfile with $cmdline_options as $user in $workdir and writes a $pidfile from that. 
 # If you want it to, it logs:
 # - stdout to $logfile, stderr to $errlogfile or 
 # - stdout and stderr to $logfile
@@ -74,14 +75,14 @@
 		if [ -n $logfile -a -n $errlogfile ]
 		then
 			# We have logfile and errlogfile, so redirect STDOUT und STDERR to different files
-			cmd=su - $user -c \nohup $binfile $cmdline_options  $logfile 2 $errlogfile  \'echo \$!' 
+			cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options  $logfile 2 $errlogfile  \'echo \$!' 
 		else if [ -n $logfile ]
 			then
 # We only have logfile so redirect STDOUT and STDERR to the same file
-cmd=su - $user -c \nohup $binfile $cmdline_options  $logfile 21  \'echo \$!' 
+cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options  $logfile 21  \'echo \$!' 
 			else
 # We have neither logfile nor errlogfile, so we're not going to redirect anything
-cmd=su - $user -c \nohup $binfile $cmdline_options  \'echo \$!'
+cmd=su - $user -c \cd $workdir; nohup $binfile $cmdline_options  \'echo \$!'
 			fi
 		fi
 		ocf_log debug Starting $process: $cmd
@@ -169,6 +170,7 @@
 process=$OCF_RESOURCE_INSTANCE
 binfile=$OCF_RESKEY_binfile
 cmdline_options=$OCF_RESKEY_cmdline_options
+workdir=$OCF_RESKEY_workdir
 pidfile=$OCF_RESKEY_pidfile
 [ -z $pidfile ]  pidfile=${HA_VARRUN}/anything_${process}.pid
 logfile=$OCF_RESKEY_logfile
@@ -225,6 +227,13 @@
 shortdesc lang=enCommand line options/shortdesc
 content type=string /
 /parameter
+parameter name=workdir required=1 unique=1
+longdesc lang=en
+The path from where the binfile will be executed.
+/longdesc
+shortdesc lang=enFull path name of the executable directory/shortdesc
+content type=string default=/
+/parameter
 parameter name=pidfile
 longdesc lang=en
 File to read/write the PID from/to.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] HELP - debugging a hanging domU boot?

2010-10-28 Thread RaSca
Il giorno Gio 28 Ott 2010 14:45:32 CET, Miles Fidelman ha scritto:
[...]
 Any suggestions?

I found a similar problem by using the wrong console:

http://groups.google.com/group/ganeti/browse_thread/thread/639f297c738e5adb

I was using ganeti, but it can't be much different.

Bye,

-- 
Raoul Scarazzini
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
mobile: +393281776712
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Handling colocation constraints with more than 2 entries

2010-10-06 Thread RaSca
Il giorno Mer 06 Ott 2010 12:34:22 CET, Dejan Muhamedagic ha scritto:
[...]
 There's also role change which forces a break in the set. There
 was posted an example like this to the list:
 (a) collocation c1 inf: ms-r0:Master fs jboss
 or, expressed as a chain of two-resource collocations:
 collocation c1_1 inf: jboss fs
 collocation c1_2 inf: fs ms-r0:Master
 Now, the resource set would have to be split in two because of
 the role change, but that would also force us to move stuff
 around to preserve semantics:
 (b) collocation c1 inf: [ fs jboss ] [ ms-r0:Master ]
 Because adjacent resource sets have the same semantics as
 two-resource collocations, right?
 The case when the role change is in the middle:
 (a) collocation c1 inf: A B:Master C
 becomes
 (b) collocation c1 inf: [ C ] [ B:Master ] [ A ]
 All this strikes me as suboptimal, but I'm not sure what is to be
 done about it.
 Basically, users should be able to type the (a) versions and let
 the software handle the rest.
 Opinions?
 Cheers,
 Dejan

I totally agree with you Dej. Note that for making things work and keep 
things compatible an automated management of the sets like this one is 
IMHO the best way.

Glad to see that my problems created such a discussion!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Trying to understand sets

2010-10-05 Thread RaSca
Il giorno Mar 05 Ott 2010 15:55:56 CET, Dejan Muhamedagic ha scritto:
[...]
 The problem seems to be in particular with collocations. Please
 see the other thread in which Andreas Kurz explained the
 differences well.
 Thanks,

Thanks to you Dejan! I'm following with all the attentions that thread.

Bye,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Trying to understand sets

2010-10-01 Thread RaSca
Hi all,
as discussed two days ago on IRC, since that the 1.0.9 version had some 
problems with multistate resources and groups, I'm trying to make sets work.
I started from this configuration:

group cluster cluster-fs cluster-jboss
ms cluster-ms-r0 cluster-r0 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
colocation cluster_on_cluster-r0 inf: cluster cluster-ms-r0:Master
order cluster_after_cluster-r0 inf: cluster-ms-r0:promote cluster-fs:start

But, as i said, there's a bug in this version of Pacemaker, so the 
solution is to treat every single resource. Thanks to the IRC guys match 
and andreask I find out a solution using sets, in this way:

ms cluster-ms-r0 cluster-r0 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
colocation cluster-fs_on_cluster-r0 inf: ( cluster-jboss ) ( cluster-fs 
) cluster-ms-r0:Master
order cluster-fs_after_cluster-r0 inf: cluster-ms-r0:promote ( 
cluster-fs:start ) ( cluster-jboss:start )

Everything works fine. But I have two questions:

1) what's the difference between the declaration with sets that is over 
there and this one:

ms cluster-ms-r0 cluster-r0 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
colocation cluster-fs_on_cluster-r0 inf: cluster-jboss cluster-fs 
cluster-ms-r0:Master
order cluster-fs_after_cluster-r0 inf: cluster-ms-r0:promote 
cluster-fs:start cluster-jboss:start

In this way the resources comes up all the same time? Note: this does 
not worked for me, of course.

2) Using sets it is also possible to declare groups just for logical 
purpose (and for the output of crm_mon), without using them in 
colocation and order declaration? Does the creation of a group make some 
changes in the resources relations?

Thanks a lot!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat/Pacemaker Italian article series complete!

2010-09-22 Thread RaSca
Il giorno Mer 22 Set 2010 13:44:07 CET, Dejan Muhamedagic ha scritto:
[...]
 Didn't understand a thing, but looks great :)

LOL!

 Just one note, it caught my attention, a node preference is
 usually expressed in non-absolute terms and using a shorter
 syntax:
 location cli-prefer-share-a share-a 100: ubuntu-nodo1
 Cheers,
 Dejan

I just used that declaration (location cli-prefer-share-a share-a rule 
inf: #uname eq ubuntu-nodo1) to reflect the rule that Pacemaker adds 
when you migrate a resource. This because in the second test I mention it.
Anyway, I will keep in mind this short declaration, it will be surely 
useful.

Thanks a lot Dejan!

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Heartbeat/Pacemaker Italian article series complete!

2010-09-21 Thread RaSca
Hi all guys,
Yesterday I've finally finished and published the last article of the 
Heartbeat/Pacemaker series. These are the links to the articles:

http://www.miamammausalinux.org/2010/04/evoluzione-dellalta-affidabilita-su-linux-come-orientarsi-fra-hertbeat-pacemaker-openais-e-corosync/

http://www.miamammausalinux.org/2010/06/evoluzione-dellalta-affidabilita-su-linux-confronto-pratico-tra-heartbeat-classico-ed-heartbeat-con-pacemaker/

http://www.miamammausalinux.org/2010/09/evoluzione-dellalta-affidabilita-su-linux-realizzare-un-nas-con-pacemaker-drbd-ed-exportfs/

All the three articles are written in Italian, but I hope you will enjoy 
anyway.

Keep up the good work! And thanks again for the help you give me anytime.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA configuration issues

2010-08-04 Thread RaSca
Il giorno Mar 03 Ago 2010 19:36:56 CET, Tim Macking ha scritto:
 I have a system that is in production and has issues.  While I appreciate
 the link of books to read, I really came here looking for some help or
 advice.  If anyone could offer some, I would be most grateful and
 appreciative.
 Telling me to go read is like telling someone who wants to know what the
 Golden Rule is to go read the Bible.
 Please, can anyone offer some advice or help her?
[...]

I suggested that read because what you asked and what is described in 
CFS is very similar. Andrew's document contains many of the information 
you need (including how RedHat and Fedora prefer corosync instead of 
heartbeat).

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA configuration issues

2010-08-02 Thread RaSca
Il giorno Lun 02 Ago 2010 15:25:36 CET, Tim Macking ha scritto:
 I am fairly new to Linux, specifically RedHat Enterprise.
 The project I have now is unraveling how 2 servers were setup with HA, why
 it is working (but not entirely), and how to get it configured correctly.
 I have read over the documentation at http://www.linux-ha.org/doc/
[...]

I strongly recommend you to give a look at Clusters From Scratch, here: 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/

It will be helpful to understand how the cluster can be configured from 
start.

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] NFSServer options question

2010-07-07 Thread RaSca
Il giorno Ven 02 Lug 2010 18:50:29 CET, Daniel Machado Grilo ha scritto:
 Dear HA users,
[...]
 I understand I have to add an nfsserver primitive for each group of
 services, as a group can migrate to other node and then, the heartbeat
 will not be able to unmount the FS because NFS is using it.
[...]

Hi Daniel,
note that you can't have two nfsserver primitives declared on different 
groups in the same cluster: nfsserver primitive refers to the system 
/etc/exports, which must be the same on the two nodes.
If you want an active-active NFS setup you should use the exportfs resource.

Have a good day,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


  1   2   >