[ClusterLabs] Antw: Re: Antw: Re: snapshots in a clvm environment - some questions for proceeding

2017-03-01 Thread Ulrich Windl
>>> "Lentes, Bernd"  schrieb am 01.03.2017 
>>> um
23:13 in Nachricht
<5769d607-e3f8-4c7d-bd70-f72e3a994...@helmholtz-muenchen.de>:

>> 
>> Actually that's what we are doing, but at some point in the past I had 
> ruined
>> the directory with the VM images (user error). THe problem then was that you
>> need a running VM to restore files inside the VM. This is when you would 
> like
>> to have a crash-consistent backup image of your VM. However I found no 
> working
>> solution yet (we have VM images hosted on OCFS2, hosted on a cLVM LV).
>> 
>> Regards,
>> Ulrich
>> 
>>> 
> 
> With OCFS2 you could snapshot (i think they call it reflink) the Image file.

I didn't find the proper tools to do so in SLES11, and the manual page is quite 
vage on using the REFLINK feature. How do you do it?

> 
> Bernd 
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Cannot clone clvmd resource

2017-03-01 Thread Ulrich Windl
Hi!

What about colocation and ordering?

Regards,
Ulrich

>>> Anne Nicolas  schrieb am 01.03.2017 um 22:49 in Nachricht
<0b585272-1c5b-0f07-1f01-747c003c6...@gmail.com>:
> Hi there
> 
> 
> I'm testing quite an easy configuration to work on clvm. I'm just
> getting crazy as it seems clmd cannot be cloned on other nodes.
> 
> clvmd start well on node1 but fails on both node2 and node3.
> 
> In pacemaker journalctl I get the following message
> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
> No such file or directory
> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
> /cmirrord: No such file or directory
> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
> p-dlm_stop_0 on node3 (local)
> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
> action:stop call_id:235
> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
> p-dlm_stop_0 on node2
> 
> Here is my configuration
> 
> node 739312139: node1
> node 739312140: node2
> node 739312141: node3
> primitive admin_addr IPaddr2 \
> params ip=172.17.2.10 \
> op monitor interval=10 timeout=20 \
> meta target-role=Started
> primitive p-clvmd ocf:lvm2:clvmd \
> op start timeout=90 interval=0 \
> op stop timeout=100 interval=0 \
> op monitor interval=30 timeout=90
> primitive p-dlm ocf:pacemaker:controld \
> op start timeout=90 interval=0 \
> op stop timeout=100 interval=0 \
> op monitor interval=60 timeout=90
> primitive stonith-sbd stonith:external/sbd
> group g-clvm p-dlm p-clvmd
> clone c-clvm g-clvm meta interleave=true
> property cib-bootstrap-options: \
> have-watchdog=true \
> dc-version=1.1.13-14.7-6f22ad7 \
> cluster-infrastructure=corosync \
> cluster-name=hacluster \
> stonith-enabled=true \
> placement-strategy=balanced \
> no-quorum-policy=freeze \
> last-lrm-refresh=1488404073
> rsc_defaults rsc-options: \
> resource-stickiness=1 \
> migration-threshold=10
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> Thanks in advance for your input
> 
> Cheers
> 
> -- 
> Anne Nicolas
> http://mageia.org 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Expected recovery behavior of remote-node guest when corosync ring0 is lost in a passive mode RRP config?

2017-03-01 Thread Ulrich Windl
>>> "Scott Greenlese"  schrieb am 01.03.2017 um 22:07 in
Nachricht
:

> Hi..
> 
> I am running a few corosync "passive mode" Redundant Ring Protocol (RRP)
> failure scenarios, where
> my cluster has several remote-node VirtualDomain resources running on each
> node in the cluster,
> which have been configured to allow Live Guest Migration (LGM) operations.
> 
> While both corosync rings are active, if I drop ring0 on a given node where
> I have remote node (guests) running,
> I noticed that the guest will be shutdown / re-started on the same host,
> after which the connection is re-established
> and the guest proceeds to run on that same cluster node.

Could it be you forgot "allow-migrate=true" at the resource level or some 
migration IP address at the node level?
I only have SLES11 here...

> 
> I am wondering why pacemaker doesn't try to "live" migrate the remote node
> (guest) to a different node, instead
> of rebooting the guest?  Is there some way to configure the remote nodes
> such that the recovery action is
> LGM instead of reboot when the host-to-remote_node connect is lost in an
> RRP situation?   I guess the
> next question is, is it even possible to LGM a remote node guest if the
> corosync ring fails over from ring0 to ring1
> (or vise-versa)?
> 
> # For example, here's a remote node's VirtualDomain resource definition.
> 
> [root@zs95kj]# pcs resource show  zs95kjg110102_res
>  Resource: zs95kjg110102_res (class=ocf provider=heartbeat
> type=VirtualDomain)
>   Attributes: config=/guestxml/nfs1/zs95kjg110102.xml
> hypervisor=qemu:///system migration_transport=ssh
>   Meta Attrs: allow-migrate=true remote-node=zs95kjg110102
> remote-addr=10.20.110.102
>   Operations: start interval=0s timeout=480
> (zs95kjg110102_res-start-interval-0s)
>   stop interval=0s timeout=120
> (zs95kjg110102_res-stop-interval-0s)
>   monitor interval=30s (zs95kjg110102_res-monitor-interval-30s)
>   migrate-from interval=0s timeout=1200
> (zs95kjg110102_res-migrate-from-interval-0s)
>   migrate-to interval=0s timeout=1200
> (zs95kjg110102_res-migrate-to-interval-0s)
> [root@zs95kj VD]#
> 
> 
> 
> 
> # My RRP rings are active, and configured "rrp_mode="passive"
> 
> [root@zs95kj ~]# corosync-cfgtool -s
> Printing ring status.
> Local node ID 2
> RING ID 0
> id  = 10.20.93.12
> status  = ring 0 active with no faults
> RING ID 1
> id  = 10.20.94.212
> status  = ring 1 active with no faults
> 
> 
> 
> # Here's the corosync.conf ..
> 
> [root@zs95kj ~]# cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: test_cluster_2
> transport: udpu
> rrp_mode: passive
> }
> 
> nodelist {
> node {
> ring0_addr: zs95kjpcs1
> ring1_addr: zs95kjpcs2
> nodeid: 2
> }
> 
> node {
> ring0_addr: zs95KLpcs1
> ring1_addr: zs95KLpcs2
> nodeid: 3
> }
> 
> node {
> ring0_addr: zs90kppcs1
> ring1_addr: zs90kppcs2
> nodeid: 4
> }
> 
> node {
> ring0_addr: zs93KLpcs1
> ring1_addr: zs93KLpcs2
> nodeid: 5
> }
> 
> node {
> ring0_addr: zs93kjpcs1
> ring1_addr: zs93kjpcs2
> nodeid: 1
> }
> }
> 
> quorum {
> provider: corosync_votequorum
> }
> 
> logging {
> to_logfile: yes
> logfile: /var/log/corosync/corosync.log
> timestamp: on
> syslog_facility: daemon
> to_syslog: yes
> debug: on
> 
> logger_subsys {
> debug: off
> subsys: QUORUM
> }
> }
> 
> 
> 
> 
> # Here's the vlan / route situation on cluster node zs95kj:
> 
> ring0 is on vlan1293
> ring1 is on vlan1294
> 
> [root@zs95kj ~]# route -n
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric RefUse
> Iface
> 0.0.0.0 10.20.93.2540.0.0.0 UG40000
> vlan1293  << default route to guests from ring0
> 9.0.0.0 9.12.23.1   255.0.0.0   UG40000
> vlan508
> 9.12.23.0   0.0.0.0 255.255.255.0   U 40000
> vlan508
> 10.20.92.0  0.0.0.0 255.255.255.0   U 40000
> vlan1292
> 10.20.93.0  0.0.0.0 255.255.255.0   U 0  00
> vlan1293  << ring0 IPs
> 10.20.93.0  0.0.0.0 255.255.255.0   U 40000
> vlan1293
> 10.20.94.0  0.0.0.0 255.255.255.0   U 0  00
> vlan1294   << ring1 IPs
> 10.20.94.0  0.0.0.0 255.255.255.0   U 40000
> vlan1294
> 10.20.101.0 0.0.0.0 255.255.255.0   U 40000
> vlan1298
> 10.20.109.0 10.20.94.254255.255.255.0   UG40000
> vlan1294  << Route to guests on 10.20.109 from ring1
> 10.20.110.0 10.20.94.254255.255.255.0   UG400

Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-03-01 Thread Andrew Beekhof
On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
 wrote:
> When I recently tried to make use of the DEGRADED monitoring results,
> I found out that it does still not work.
>
> Because LRMD choses to filter them in ocf2uniform_rc(),
> and maps them to PCMK_OCF_UNKNOWN_ERROR.
>
> See patch suggestion below.
>
> It also filters away the other "special" rc values.
> Do we really not want to see them in crmd/pengine?

I would think we do.

> Why does LRMD think it needs to outsmart the pengine?

Because the person that implemented the feature incorrectly assumed
the rc would be passed back unmolested.

>
> Note: I did build it, but did not use this yet,
> so I have no idea if the rest of the implementation of the DEGRADED
> stuff works as intended or if there are other things missing as well.

failcount might be the other place that needs some massaging.
specifically, not incrementing it when a degraded rc comes through

>
> Thougts?\

looks good to me

>
> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
> index 724edb7..39a7dd1 100644
> --- a/lrmd/lrmd.c
> +++ b/lrmd/lrmd.c
> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const char 
> *stdout_data)
>  static int
>  ocf2uniform_rc(int rc)
>  {
> -if (rc < 0 || rc > PCMK_OCF_FAILED_MASTER) {
> -return PCMK_OCF_UNKNOWN_ERROR;
> +switch (rc) {
> +default:
> +   return PCMK_OCF_UNKNOWN_ERROR;
> +
> +case PCMK_OCF_OK:
> +case PCMK_OCF_UNKNOWN_ERROR:
> +case PCMK_OCF_INVALID_PARAM:
> +case PCMK_OCF_UNIMPLEMENT_FEATURE:
> +case PCMK_OCF_INSUFFICIENT_PRIV:
> +case PCMK_OCF_NOT_INSTALLED:
> +case PCMK_OCF_NOT_CONFIGURED:
> +case PCMK_OCF_NOT_RUNNING:
> +case PCMK_OCF_RUNNING_MASTER:
> +case PCMK_OCF_FAILED_MASTER:
> +
> +case PCMK_OCF_DEGRADED:
> +case PCMK_OCF_DEGRADED_MASTER:
> +   return rc;
> +
> +#if 0
> +   /* What about these?? */

yes, these should get passed back as-is too

> +/* 150-199 reserved for application use */
> +PCMK_OCF_CONNECTION_DIED = 189, /* Operation failure implied by 
> disconnection of the LRM API to a local or remote node */
> +
> +PCMK_OCF_EXEC_ERROR= 192, /* Generic problem invoking the agent */
> +PCMK_OCF_UNKNOWN   = 193, /* State of the service is unknown - used 
> for recording in-flight operations */
> +PCMK_OCF_SIGNAL= 194,
> +PCMK_OCF_NOT_SUPPORTED = 195,
> +PCMK_OCF_PENDING   = 196,
> +PCMK_OCF_CANCELLED = 197,
> +PCMK_OCF_TIMEOUT   = 198,
> +PCMK_OCF_OTHER_ERROR   = 199, /* Keep the same codes as PCMK_LSB */
> +#endif
>  }
> -
> -return rc;
>  }
>
>  static int
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] R: Re: Antw: Re: Ordering Sets of Resources

2017-03-01 Thread Ken Gaillot
On 03/01/2017 03:22 PM, iva...@libero.it wrote:
> You are right, but i had to use option symmetrical=false because i need to 
> stop, when all resources are running, even the single primitive with no 
> impact 
> to others resources.
> 
> I have also used symmetrical=false with kind=Optional.
> The stop of the individual resource does not stop the others resources, but 
> if 
> during the startup or shutdown of the resources is used a list of primitives 
> without any order, the resources will start or stop without respecting the 
> constraint strictly.
> 
> Regards
> Ivan

If I understand, you want to be able to specify resources A B C such
that they always start in that order, but stopping can be in any
combination:
* just A
* just B
* just C
* just A and B (in which case B stops then A)
* just A and C (in which case C stops then A)
* just B and C (in which case C stops then B)
* or all (in which case C stops, then B, then A)

There may be a fancy way to do it with sets, but my first thought is:

* Keep the start constraint you have

* Use individual ordering constraints between each resource pair with
kind=Optional and action=stop

>> Messaggio originale
>> Da: "Ken Gaillot" 
>> Data: 01/03/2017 15.57
>> A: "Ulrich Windl", 
>> Ogg: Re: [ClusterLabs] Antw: Re:  Ordering Sets of Resources
>>
>> On 03/01/2017 01:36 AM, Ulrich Windl wrote:
>> Ken Gaillot  schrieb am 26.02.2017 um 20:04 in 
> Nachricht
>>> :
 On 02/25/2017 03:35 PM, iva...@libero.it wrote:
> Hi all,
> i have configured a two node cluster on redhat 7.
>
> Because I need to manage resources stopping and starting singularly when
> they are running I have configured cluster using order set constraints.
>
> Here the example
>
> Ordering Constraints:
>   Resource Sets:
> set MYIP_1 MYIP_2 MYFTP MYIP_5 action=start sequential=false
> require-all=true set MYIP_3 MYIP_4 MYSMTP action=start sequential=true
> require-all=true setoptions symmetrical=false
> set MYSMTP MYIP_4 MYIP_3 action=stop sequential=true
> require-all=true set MYIP_5 MYFTP MYIP_2 MYIP_1 action=stop
> sequential=true require-all=true setoptions symmetrical=false 
> kind=Mandatory
>
> The constrait work as expected on start but when stopping the resource
> don't respect the order.
> Any help is appreciated
>
> Thank and regards
> Ivan

 symmetrical=false means the order only applies for starting
>>>
>>> From the name (symmetrical) alone it could also mean that it only applies 
> for stopping ;-)
>>> (Another example where better names would be nice)
>>
>> Well, more specifically, it only applies to the action specified in the
>> constraint. I hadn't noticed before that the second constraint here has
>> action=stop, so yes, that one would only apply for stopping.
>>
>> In the above example, the two constraints are identical to a single
>> constraint with symmetrical=true, since the second constraint is just
>> the reverse of the first.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cannot clone clvmd resource

2017-03-01 Thread Ken Gaillot
On 03/01/2017 03:49 PM, Anne Nicolas wrote:
> Hi there
> 
> 
> I'm testing quite an easy configuration to work on clvm. I'm just
> getting crazy as it seems clmd cannot be cloned on other nodes.
> 
> clvmd start well on node1 but fails on both node2 and node3.

Your config looks fine, so I'm going to guess there's some local
difference on the nodes.

> In pacemaker journalctl I get the following message
> Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
> No such file or directory
> Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
> /cmirrord: No such file or directory

I have no idea where the above is coming from. pidofproc is an LSB
function, but (given journalctl) I'm assuming you're using systemd. I
don't think anything in pacemaker or resource-agents uses pidofproc (at
least not currently, not sure about the older version you're using).

> Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
> action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
> Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
> (node=node3, call=233, rc=0, cib-update=541, confirmed=true)
> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
> p-dlm_stop_0 on node3 (local)
> Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
> action:stop call_id:235
> Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
> p-dlm_stop_0 on node2
> 
> Here is my configuration
> 
> node 739312139: node1
> node 739312140: node2
> node 739312141: node3
> primitive admin_addr IPaddr2 \
> params ip=172.17.2.10 \
> op monitor interval=10 timeout=20 \
> meta target-role=Started
> primitive p-clvmd ocf:lvm2:clvmd \
> op start timeout=90 interval=0 \
> op stop timeout=100 interval=0 \
> op monitor interval=30 timeout=90
> primitive p-dlm ocf:pacemaker:controld \
> op start timeout=90 interval=0 \
> op stop timeout=100 interval=0 \
> op monitor interval=60 timeout=90
> primitive stonith-sbd stonith:external/sbd
> group g-clvm p-dlm p-clvmd
> clone c-clvm g-clvm meta interleave=true
> property cib-bootstrap-options: \
> have-watchdog=true \
> dc-version=1.1.13-14.7-6f22ad7 \
> cluster-infrastructure=corosync \
> cluster-name=hacluster \
> stonith-enabled=true \
> placement-strategy=balanced \
> no-quorum-policy=freeze \
> last-lrm-refresh=1488404073
> rsc_defaults rsc-options: \
> resource-stickiness=1 \
> migration-threshold=10
> op_defaults op-options: \
> timeout=600 \
> record-pending=true
> 
> Thanks in advance for your input
> 
> Cheers
> 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: snapshots in a clvm environment - some questions for proceeding

2017-03-01 Thread Lentes, Bernd

> 
> Actually that's what we are doing, but at some point in the past I had ruined
> the directory with the VM images (user error). THe problem then was that you
> need a running VM to restore files inside the VM. This is when you would like
> to have a crash-consistent backup image of your VM. However I found no working
> solution yet (we have VM images hosted on OCFS2, hosted on a cLVM LV).
> 
> Regards,
> Ulrich
> 
>> 

With OCFS2 you could snapshot (i think they call it reflink) the Image file.

Bernd 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cannot clone clvmd resource

2017-03-01 Thread Anne Nicolas
Hi there


I'm testing quite an easy configuration to work on clvm. I'm just
getting crazy as it seems clmd cannot be cloned on other nodes.

clvmd start well on node1 but fails on both node2 and node3.

In pacemaker journalctl I get the following message
Mar 01 16:34:36 node3 pidofproc[27391]: pidofproc: cannot stat /clvmd:
No such file or directory
Mar 01 16:34:36 node3 pidofproc[27392]: pidofproc: cannot stat
/cmirrord: No such file or directory
Mar 01 16:34:36 node3 lrmd[2174]: notice: finished - rsc:p-clvmd
action:stop call_id:233 pid:27384 exit-code:0 exec-time:45ms queue-time:0ms
Mar 01 16:34:36 node3 crmd[2177]: notice: Operation p-clvmd_stop_0: ok
(node=node3, call=233, rc=0, cib-update=541, confirmed=true)
Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 72: stop
p-dlm_stop_0 on node3 (local)
Mar 01 16:34:36 node3 lrmd[2174]: notice: executing - rsc:p-dlm
action:stop call_id:235
Mar 01 16:34:36 node3 crmd[2177]: notice: Initiating action 67: stop
p-dlm_stop_0 on node2

Here is my configuration

node 739312139: node1
node 739312140: node2
node 739312141: node3
primitive admin_addr IPaddr2 \
params ip=172.17.2.10 \
op monitor interval=10 timeout=20 \
meta target-role=Started
primitive p-clvmd ocf:lvm2:clvmd \
op start timeout=90 interval=0 \
op stop timeout=100 interval=0 \
op monitor interval=30 timeout=90
primitive p-dlm ocf:pacemaker:controld \
op start timeout=90 interval=0 \
op stop timeout=100 interval=0 \
op monitor interval=60 timeout=90
primitive stonith-sbd stonith:external/sbd
group g-clvm p-dlm p-clvmd
clone c-clvm g-clvm meta interleave=true
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.13-14.7-6f22ad7 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
placement-strategy=balanced \
no-quorum-policy=freeze \
last-lrm-refresh=1488404073
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=10
op_defaults op-options: \
timeout=600 \
record-pending=true

Thanks in advance for your input

Cheers

-- 
Anne Nicolas
http://mageia.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] R: Re: Antw: Re: Ordering Sets of Resources

2017-03-01 Thread iva...@libero.it
You are right, but i had to use option symmetrical=false because i need to 
stop, when all resources are running, even the single primitive with no impact 
to others resources.

I have also used symmetrical=false with kind=Optional.
The stop of the individual resource does not stop the others resources, but if 
during the startup or shutdown of the resources is used a list of primitives 
without any order, the resources will start or stop without respecting the 
constraint strictly.

Regards
Ivan


>Messaggio originale
>Da: "Ken Gaillot" 
>Data: 01/03/2017 15.57
>A: "Ulrich Windl", 
>Ogg: Re: [ClusterLabs] Antw: Re:  Ordering Sets of Resources
>
>On 03/01/2017 01:36 AM, Ulrich Windl wrote:
> Ken Gaillot  schrieb am 26.02.2017 um 20:04 in 
Nachricht
>> :
>>> On 02/25/2017 03:35 PM, iva...@libero.it wrote:
 Hi all,
 i have configured a two node cluster on redhat 7.

 Because I need to manage resources stopping and starting singularly when
 they are running I have configured cluster using order set constraints.

 Here the example

 Ordering Constraints:
   Resource Sets:
 set MYIP_1 MYIP_2 MYFTP MYIP_5 action=start sequential=false
 require-all=true set MYIP_3 MYIP_4 MYSMTP action=start sequential=true
 require-all=true setoptions symmetrical=false
 set MYSMTP MYIP_4 MYIP_3 action=stop sequential=true
 require-all=true set MYIP_5 MYFTP MYIP_2 MYIP_1 action=stop
 sequential=true require-all=true setoptions symmetrical=false 
kind=Mandatory

 The constrait work as expected on start but when stopping the resource
 don't respect the order.
 Any help is appreciated

 Thank and regards
 Ivan
>>>
>>> symmetrical=false means the order only applies for starting
>> 
>> From the name (symmetrical) alone it could also mean that it only applies 
for stopping ;-)
>> (Another example where better names would be nice)
>
>Well, more specifically, it only applies to the action specified in the
>constraint. I hadn't noticed before that the second constraint here has
>action=stop, so yes, that one would only apply for stopping.
>
>In the above example, the two constraints are identical to a single
>constraint with symmetrical=true, since the second constraint is just
>the reverse of the first.
>
>
>___
>Users mailing list: Users@clusterlabs.org
>http://lists.clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org
>



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Expected recovery behavior of remote-node guest when corosync ring0 is lost in a passive mode RRP config?

2017-03-01 Thread Scott Greenlese

Hi..

I am running a few corosync "passive mode" Redundant Ring Protocol (RRP)
failure scenarios, where
my cluster has several remote-node VirtualDomain resources running on each
node in the cluster,
which have been configured to allow Live Guest Migration (LGM) operations.

While both corosync rings are active, if I drop ring0 on a given node where
I have remote node (guests) running,
I noticed that the guest will be shutdown / re-started on the same host,
after which the connection is re-established
and the guest proceeds to run on that same cluster node.

I am wondering why pacemaker doesn't try to "live" migrate the remote node
(guest) to a different node, instead
of rebooting the guest?  Is there some way to configure the remote nodes
such that the recovery action is
LGM instead of reboot when the host-to-remote_node connect is lost in an
RRP situation?   I guess the
next question is, is it even possible to LGM a remote node guest if the
corosync ring fails over from ring0 to ring1
(or vise-versa)?

# For example, here's a remote node's VirtualDomain resource definition.

[root@zs95kj]# pcs resource show  zs95kjg110102_res
 Resource: zs95kjg110102_res (class=ocf provider=heartbeat
type=VirtualDomain)
  Attributes: config=/guestxml/nfs1/zs95kjg110102.xml
hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true remote-node=zs95kjg110102
remote-addr=10.20.110.102
  Operations: start interval=0s timeout=480
(zs95kjg110102_res-start-interval-0s)
  stop interval=0s timeout=120
(zs95kjg110102_res-stop-interval-0s)
  monitor interval=30s (zs95kjg110102_res-monitor-interval-30s)
  migrate-from interval=0s timeout=1200
(zs95kjg110102_res-migrate-from-interval-0s)
  migrate-to interval=0s timeout=1200
(zs95kjg110102_res-migrate-to-interval-0s)
[root@zs95kj VD]#




# My RRP rings are active, and configured "rrp_mode="passive"

[root@zs95kj ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id  = 10.20.93.12
status  = ring 0 active with no faults
RING ID 1
id  = 10.20.94.212
status  = ring 1 active with no faults



# Here's the corosync.conf ..

[root@zs95kj ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: test_cluster_2
transport: udpu
rrp_mode: passive
}

nodelist {
node {
ring0_addr: zs95kjpcs1
ring1_addr: zs95kjpcs2
nodeid: 2
}

node {
ring0_addr: zs95KLpcs1
ring1_addr: zs95KLpcs2
nodeid: 3
}

node {
ring0_addr: zs90kppcs1
ring1_addr: zs90kppcs2
nodeid: 4
}

node {
ring0_addr: zs93KLpcs1
ring1_addr: zs93KLpcs2
nodeid: 5
}

node {
ring0_addr: zs93kjpcs1
ring1_addr: zs93kjpcs2
nodeid: 1
}
}

quorum {
provider: corosync_votequorum
}

logging {
to_logfile: yes
logfile: /var/log/corosync/corosync.log
timestamp: on
syslog_facility: daemon
to_syslog: yes
debug: on

logger_subsys {
debug: off
subsys: QUORUM
}
}




# Here's the vlan / route situation on cluster node zs95kj:

ring0 is on vlan1293
ring1 is on vlan1294

[root@zs95kj ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric RefUse
Iface
0.0.0.0 10.20.93.2540.0.0.0 UG40000
vlan1293  << default route to guests from ring0
9.0.0.0 9.12.23.1   255.0.0.0   UG40000
vlan508
9.12.23.0   0.0.0.0 255.255.255.0   U 40000
vlan508
10.20.92.0  0.0.0.0 255.255.255.0   U 40000
vlan1292
10.20.93.0  0.0.0.0 255.255.255.0   U 0  00
vlan1293  << ring0 IPs
10.20.93.0  0.0.0.0 255.255.255.0   U 40000
vlan1293
10.20.94.0  0.0.0.0 255.255.255.0   U 0  00
vlan1294   << ring1 IPs
10.20.94.0  0.0.0.0 255.255.255.0   U 40000
vlan1294
10.20.101.0 0.0.0.0 255.255.255.0   U 40000
vlan1298
10.20.109.0 10.20.94.254255.255.255.0   UG40000
vlan1294  << Route to guests on 10.20.109 from ring1
10.20.110.0 10.20.94.254255.255.255.0   UG40000
vlan1294  << Route to guests on 10.20.110 from ring1
169.254.0.0 0.0.0.0 255.255.0.0 U 1007   00
enccw0.0.02e0
169.254.0.0 0.0.0.0 255.255.0.0 U 1016   00
ovsbridge1
192.168.122.0   0.0.0.0 255.255.255.0   U 0  00
virbr0



# On remote node, you can see we have a connection back to the host.

Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info:
crm_log_init:  Changed active directory to /var/lib/heartbeat/cores/root
Feb 28 14:30:22 [928] zs95kjg110102 pacemaker_remoted: info:
qb_ipcs_us_publish:server name: lrmd

Re: [ClusterLabs] Antw: Re: Ordering Sets of Resources

2017-03-01 Thread Ken Gaillot
On 03/01/2017 01:36 AM, Ulrich Windl wrote:
 Ken Gaillot  schrieb am 26.02.2017 um 20:04 in 
 Nachricht
> :
>> On 02/25/2017 03:35 PM, iva...@libero.it wrote:
>>> Hi all,
>>> i have configured a two node cluster on redhat 7.
>>>
>>> Because I need to manage resources stopping and starting singularly when
>>> they are running I have configured cluster using order set constraints.
>>>
>>> Here the example
>>>
>>> Ordering Constraints:
>>>   Resource Sets:
>>> set MYIP_1 MYIP_2 MYFTP MYIP_5 action=start sequential=false
>>> require-all=true set MYIP_3 MYIP_4 MYSMTP action=start sequential=true
>>> require-all=true setoptions symmetrical=false
>>> set MYSMTP MYIP_4 MYIP_3 action=stop sequential=true
>>> require-all=true set MYIP_5 MYFTP MYIP_2 MYIP_1 action=stop
>>> sequential=true require-all=true setoptions symmetrical=false kind=Mandatory
>>>
>>> The constrait work as expected on start but when stopping the resource
>>> don't respect the order.
>>> Any help is appreciated
>>>
>>> Thank and regards
>>> Ivan
>>
>> symmetrical=false means the order only applies for starting
> 
> From the name (symmetrical) alone it could also mean that it only applies for 
> stopping ;-)
> (Another example where better names would be nice)

Well, more specifically, it only applies to the action specified in the
constraint. I hadn't noticed before that the second constraint here has
action=stop, so yes, that one would only apply for stopping.

In the above example, the two constraints are identical to a single
constraint with symmetrical=true, since the second constraint is just
the reverse of the first.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: using IPMI for fencing - configuring IPMI with ipmitool - SOLVED

2017-03-01 Thread Lentes, Bernd


- On Mar 1, 2017, at 1:41 PM, Bernd Lentes 
bernd.len...@helmholtz-muenchen.de wrote:


Hi,

i managed it:
stonith -t external/ipmi hostname=ha-idg-1 ipaddr=146.107.235.15 userid=root 
passwd=xx passwd_method=param interface=lanplus -S
info: external/ipmi device OK.


I had to use IPMI v2.0. So i had to use the lanplus interface to connect. IPMI 
v1.5 is deactivated, so trying to connect via the lan interface can't succeed.

ha-idg-1:~ # ipmitool channel authcap 2 4
Channel number : 2
IPMI v1.5  auth types  :  <==
KG status  : default (all zeroes)
Per message authentication : disabled
User level authentication  : enabled
Non-null user names exist  : yes
Null user names exist  : no
Anonymous login enabled: no
Channel supports IPMI v1.5 : no   <=
Channel supports IPMI v2.0 : yes


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-01 Thread Oscar Segarra
Hi Dejan,

In my environment, is it possible to launch the check from the hypervisor.
A simple telnet against an specific port may be enough tp check if service
is ready.

In this simple scenario (and check) how can I instruct the second server to
wait the mysql server is up?

Thanks a lot

El 1 mar. 2017 1:08 p. m., "Dejan Muhamedagic" 
escribió:

> Hi,
>
> On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
> > Hi,
> >
> > Yes,
> >
> > Database server can be considered started up when it accepts mysql client
> > connections
> > Applications server can be considered started as soon as the listening
> port
> > is up al accepting connections
> >
> > ¿Can you provide any example about how to achieve this?
>
> Is it possible to connect to the database from the supervisor?
> Then something like this would do:
>
> mysql -h vm_ip_address ... < /dev/null
>
> If not, then if ssh works:
>
> echo mysql ... | ssh vm_ip_address
>
> I'm afraid I cannot help you more with mysql details and what to
> put in '...' stead above, but it should do whatever is necessary
> to test if the database reached the functional state. You can
> find an example in ocf:heartbeat:mysql: just look for the
> "test_table" parameter. Of course, you'll need to put that in a
> script and test output and so on. I guess that there's enough
> information in internet on how to do that.
>
> Good luck!
>
> Dejan
>
> > Thanks a lot.
> >
> >
> > 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic :
> >
> > > Hi,
> > >
> > > On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> > > > Hi,
> > > >
> > > > In my environment I have 5 guestes that have to be started up in a
> > > > specified order starting for the MySQL database server.
> > > >
> > > > I have set the order constraints and VirtualDomains start in the
> right
> > > > order but, the problem I have, is that the second host starts up
> faster
> > > > than the database server and therefore applications running on the
> second
> > > > host raise errors due to database connectivity problems.
> > > >
> > > > I'd like to introduce a delay between the startup of the
> VirtualDomain of
> > > > the database server and the startup of the second guest.
> > >
> > > Do you have a way to check if this server is up? If so...
> > > The start action of VirtualDomain won't exit until the monitor
> > > action returns success. And there's a parameter called
> > > monitor_scripts (see the meta-data). Note that these programs
> > > (scripts) are run at the supervisor host and not in the guest.
> > > It's all a bit involved, but should be doable.
> > >
> > > Thanks,
> > >
> > > Dejan
> > >
> > > > ¿Is it any way to get this?
> > > >
> > > > Thanks a lot.
> > >
> > > > ___
> > > > Users mailing list: Users@clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/
> doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > >
> > >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/
> doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
>
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: using IPMI for fencing - configuring IPMI with ipmitool - HELP

2017-03-01 Thread Lentes, Bernd

- On Mar 1, 2017, at 10:20 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:


>> Hi,
>> 
>> i have a HP server ML 350 G9 with an ILO4 card. The riloe stonith agent does
> 
>> not work, i read in a book the recommendation to use the ipmi ressource
> agent
>> instead.
> 
> Why don't you use SBD (as recommended)?
> 

Hi,

i want to have two independent possibilities for fencing.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: snapshots in a clvm environment - some questions for proceeding

2017-03-01 Thread Lentes, Bernd

- On Mar 1, 2017, at 8:11 AM, Ulrich Windl 
ulrich.wi...@rz.uni-regensburg.de wrote:

 Digimer  schrieb am 24.02.2017 um 19:20 in Nachricht
> <8762afa9-0f45-04c7-2404-565dcabf9...@alteeve.ca>:
> 
> [...]
>> Aside from this, I strongly recommend against snapshots as a backup
>> mechanism anyway. There is no way to ensure that that operating system
>> and applications are in a clean state when you take the snapshot, so
>> using the image is like recovering from sudden power loss. If data was
>> in cache but not flushed out, you could have corruption.
>> 
>> If you can't stop your VMs, I'd recommend using a backup application
>> inside the VM that knows how to ensure that your apps and the OS are in
>> a clean state, particularly for DBs.
> 
> Actually that's what we are doing, but at some point in the past I had ruined
> the directory with the VM images (user error). THe problem then was that you
> need a running VM to restore files inside the VM. This is when you would like
> to have a crash-consistent backup image of your VM. However I found no working
> solution yet (we have VM images hosted on OCFS2, hosted on a cLVM LV).
> 

I changed the approach. Inside the vm i have also lv's. When changing 
configuration inside the vm
i will snapshot inside. 


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] using IPMI for fencing - configuring IPMI with ipmitool - HELP

2017-03-01 Thread Lentes, Bernd


- On Mar 1, 2017, at 5:03 AM, Andrei Borzenkov arvidj...@gmail.com wrote:

> 28.02.2017 20:39, Lentes, Bernd пишет:
>> Hi,
>> 
>> i have a HP server ML 350 G9 with an ILO4 card. The riloe stonith agent does 
>> not
>> work, i read in a book the recommendation to use the ipmi ressource agent
>> instead.
>> I'm trying to configure the respective ILO adapter with ipmitool.
> 
> Why do not you simply go to iLO4 web interface and configure users there?

I have users configured there. But are they visible for IPMI ?


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Oralsnr/Oracle resources agents

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Sun, Feb 26, 2017 at 09:51:47AM +0300, Andrei Borzenkov wrote:
> 25.02.2017 23:18, Jihed M'selmi пишет:
> > [DM] I thought that oracle listener is not consuming that many resources.
> > At any rate, ocf:heartbeat:oralsnr doesn't support single listener for
> > multiple instances. Do you have an idea how to do that? How to deal with
> > the tnsping then? Maybe you're better off with the system start script in
> > this case.
> > 
> > [JM] According to the dba, it could lead some memory issue when the
> > listener serves many instances at the same time (in my experience, I have
> > never faced this issue).
> > 
> 
> What "it" means in the above sentence? "Running single listener for
> multiple instances" or "running each instance with own listener"?
> 
> How many instances are we talking about?
> 
> > Let's take a case when the listener is serving multiple instance, and one
> > of the instance fails => ocf:heartbeat:oracle will relocate it to another
> > node, the listener should follow (especially, when we use
> > collocation constraint between RA oracle and oralsnr) this will have a bad
> > impact on the rest of instances.
> > 
> > One of the option is to have two listeners (one per node) and configured
> > outside the cluster to host the all instance. But, I keep looking for a
> > better solution.
> > 
> > [DM] Hmm, what should then the RA do? Skip the instance and report it 
> > started?
> > I'm not sure I follow.
> > [JM] The DBA use a flag Y/N to tell if this instance should run or no. It
> > could be better, for RA to use this flag too: when it's Y start the
> > instance and when It's N, the RA should not start the instance and suitable
> > message in log will be usefull to describe the situation. Now, the
> > challenge is how to monitor this flag.
> > 
> 
> DBA still needs to remember to change this flag on each node in the
> cluster. In which case it can just as well remember to use different way
> to disable automatic startup.

Indeed. I wonder what is the difference between editing a file
and running say "crm rsc stop db".

At any rate, I doubt that there is a sane way for the RA to
handle such a case.

> > One of the issue that I faced when the DBA when to shutdown the listener
> > and the instance (for launch the cold backup) but, the RA keep pushing them
> > ON. -- Note the dba team usually don't have an access to pcs to disable the
> > resource during this type of operation.
> > 
> 
> There is nothing new. Once you put application under HA control, you in
> general cannot use native application tools to manage it. That is why
> SAP introduced "cluster glue" layer that intercepts native requests to
> start/stop application and forwards them to cluster for actual processing.
> 
> The solution here is really to make it possible to delegate control of
> individual resources to different users, so that DBA can
> start/stop/disable/unmanage individual resources (s)he owns.
> 
> If nothing else, it can be done as SUID scripts that implement this check.

Users other than root/hacluster can be given access to cluster
(essentially editing the CIB). Details escape me now, but there
is a concept of roles and then one can define rules on what roles
are allowed to do.

Thanks,

Dejan

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Mon, Feb 27, 2017 at 12:38:07PM +0100, Ferenc Wágner wrote:
> Oscar Segarra  writes:
> 
> > In my environment I have 5 guestes that have to be started up in a
> > specified order starting for the MySQL database server.
> 
> We use a somewhat redesigned resource agent, which connects to the guest
> using a virtio channel and waits for a signal before exiting from the
> start operation.  The signal is sent by an approriately placed startup
> script from the guest.  This is fully independent from regular network
> traffic and does not need any channel configuration.

Cool. Maybe you'd like to share the code or, best, do a pull
request at github. This is certainly very useful.

Thanks,

Dejan

> -- 
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
> Hi,
> 
> Yes,
> 
> Database server can be considered started up when it accepts mysql client
> connections
> Applications server can be considered started as soon as the listening port
> is up al accepting connections
> 
> ¿Can you provide any example about how to achieve this?

Is it possible to connect to the database from the supervisor?
Then something like this would do:

mysql -h vm_ip_address ... < /dev/null

If not, then if ssh works:

echo mysql ... | ssh vm_ip_address

I'm afraid I cannot help you more with mysql details and what to
put in '...' stead above, but it should do whatever is necessary
to test if the database reached the functional state. You can
find an example in ocf:heartbeat:mysql: just look for the
"test_table" parameter. Of course, you'll need to put that in a
script and test output and so on. I guess that there's enough
information in internet on how to do that.

Good luck!

Dejan

> Thanks a lot.
> 
> 
> 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic :
> 
> > Hi,
> >
> > On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> > > Hi,
> > >
> > > In my environment I have 5 guestes that have to be started up in a
> > > specified order starting for the MySQL database server.
> > >
> > > I have set the order constraints and VirtualDomains start in the right
> > > order but, the problem I have, is that the second host starts up faster
> > > than the database server and therefore applications running on the second
> > > host raise errors due to database connectivity problems.
> > >
> > > I'd like to introduce a delay between the startup of the VirtualDomain of
> > > the database server and the startup of the second guest.
> >
> > Do you have a way to check if this server is up? If so...
> > The start action of VirtualDomain won't exit until the monitor
> > action returns success. And there's a parameter called
> > monitor_scripts (see the meta-data). Note that these programs
> > (scripts) are run at the supervisor host and not in the guest.
> > It's all a bit involved, but should be doable.
> >
> > Thanks,
> >
> > Dejan
> >
> > > ¿Is it any way to get this?
> > >
> > > Thanks a lot.
> >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Never join a list without a problem...

2017-03-01 Thread Ferenc Wágner
Jeffrey Westgate  writes:

> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
> longer, and we cannot set a clock by it - while the machine is 95%
> idle (or more according to 'top'), the host load shoots up to 50 or
> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
> to come back down to baseline, which is mostly 0.00.  (attached
> hostload.pdf) This happens to both machines, randomly, and is
> concerning, as we'd like to find what's causing it and resolve it.

Try running atop (http://www.atoptool.nl/).  It collects and logs
process accounting info, allowing you to step back in time and check
resource usage in the past.
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-01 Thread Ulrich Windl
>>> Kai Dupke  schrieb am 01.03.2017 um 09:55 in Nachricht
:
> On 02/27/2017 02:26 PM, Jeffrey Westgate  wrote:
>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.
> 
> So, you have a time window of ~1h where the system is under load, right?
> This is somewhat different to what Ulrich had, but his approach might be
> useful for you, too.
> 
> Something against running some monitoring and capturing the processes,
> process states and load say, every 5 minutes?
> 
> Of course, the peaks might correlate to something in the logs - like
> cron, logins, logrotates or whatever.

The main issue is "expected load" vs. "unexpected load". In my case the system 
was expected to be completely idle at night, so I had set the thresholds rather 
low. Other systems can use different approaches. I hope to hear what caused the 
problem in your case.

Ulrich




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: using IPMI for fencing - configuring IPMI with ipmitool - HELP

2017-03-01 Thread Ulrich Windl
>>> "Lentes, Bernd"  schrieb am 28.02.2017
um
18:39 in Nachricht
<476524732.41182296.1488303562492.javamail.zim...@helmholtz-muenchen.de>:
> Hi,
> 
> i have a HP server ML 350 G9 with an ILO4 card. The riloe stonith agent does

> not work, i read in a book the recommendation to use the ipmi ressource
agent 
> instead.

Why don't you use SBD (as recommended)?

> I'm trying to configure the respective ILO adapter with ipmitool. OMG. 
> Ipmitool drives me crazy.
> It's a SLES 11 SP4 node. I did "/etc/init.d/ipmi start", some modules are 
> loaded:
> 
> ha-idg-1:~ # lsmod|grep -i ipmi
> ipmi_devintf   17560  0
> ipmi_si53422  0
> ipmi_msghandler49979  2 ipmi_devintf,ipmi_si
> 
> I have a device file:
> 
> ha-idg-1:~ # ll /dev/ipm*
> crw-rw 1 root root 246, 0 Feb 28 13:51 /dev/ipmi0
> 
> What i found out/did already:
> 
> For channel 2 i have two users configured:
> 
> ipmitool> user list 2
> 1   Administratortruefalse  true   ADMINISTRATOR
> 2   root truefalse  true   ADMINISTRATOR
> 3   (Empty User) truefalse  false  NO ACCESS
> 4   (Empty User) truefalse  false  NO ACCESS
> 5   (Empty User) truefalse  false  NO ACCESS
> 6   (Empty User) truefalse  false  NO ACCESS
> 7   (Empty User) truefalse  false  NO ACCESS
> 8   (Empty User) truefalse  false  NO ACCESS
> 9   (Empty User) truefalse  false  NO ACCESS
> 10  (Empty User) truefalse  false  NO ACCESS
> 11  (Empty User) truefalse  false  NO ACCESS
> 12  (Empty User) truefalse  false  NO ACCESS
> 
> User root has a passsword which i tested via "user test" and it was ok.
> 
> Channel 2:
> 
> ipmitool> channel info 2
> Channel 0x2 info:
>   Channel Medium Type   : 802.3 LAN
>   Channel Protocol Type : IPMB-1.0
>   Session Support   : multi-session
>   Active Session Count  : 0
>   Protocol Vendor ID: 7154
>   Volatile(active) Settings
> Alerting: enabled
> Per-message Auth: disabled
> User Level Auth : enabled
> Access Mode : always available
>   Non-Volatile Settings
> Alerting: enabled
> Per-message Auth: disabled
> User Level Auth : enabled
> Access Mode : always available
> 
> ipmitool> lan print 2
> Set in Progress : Set Complete
> Auth Type Support   :
> Auth Type Enable: Callback :
> : User :
> : Operator :
> : Admin:
> : OEM  :
> IP Address Source   : DHCP Address
> IP Address  : 146.107.235.15
> Subnet Mask : 255.255.255.0
> MAC Address : 70:10:6f:47:0c:48
> SNMP Community String   :
> BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled
> Default Gateway IP  : 146.107.235.1
> 802.1q VLAN ID  : Disabled
> 802.1q VLAN Priority: 0
> RMCP+ Cipher Suites : 0,1,2,3
> Cipher Suite Priv Max   : XuuaXXX
> : X=Cipher Suite Unused
> : c=CALLBACK
> : u=USER
> : o=OPERATOR
> : a=ADMIN
> : O=OEM
> 
> How can i grant principal access to channel 2 ?
> I tried:
> 
> ipmitool> lan set 2 access on
> Set Channel Access for channel 2 failed: Unknown (0x83)
> ipmitool> lan set 2 access ON
> lan set access 
> ipmitool> lan set 2 access=ON
> lan set access 
> 
> Does not seem to work.
> 
> I did "lan set user 2", do not know if it's helpful.
> 
> Also:
> 
> ipmitool> channel authcap 2 4
> Channel number : 2
> IPMI v1.5  auth types  :
> KG status  : default (all zeroes)
> Per message authentication : disabled
> User level authentication  : enabled
> Non-null user names exist  : yes
> Null user names exist  : no
> Anonymous login enabled: no
> Channel supports IPMI v1.5 : no
> Channel supports IPMI v2.0 : yes
> 
> Don't know if it helps.
> 
> I found 
>
https://www.thomas-krenn.com/de/wiki/IPMI_Konfiguration_unter_Linux_mittels_i

> pmitool (sorry, only in german):
> 
> I did, as proposed:
> 
> ha-idg-1:~ # ipmitool lan set 2 auth ADMIN MD5
> ha-idg-1:~ # ipmitool lan set 2 access on
> Set Channel Access for channel 2 failed: Unknown (0x83)   <= ???
> 
> ha-idg-1:~ # ipmitool lan print 2
> Set in Progress : Set Complete
> Auth Type Support   :
> Auth Type Enable: Callback :
> : User :
> : Operator :
> : Admin:
> : OEM  :
> IP Address Source   : DHCP Address
> IP Address  : 146.107.235.15
> Subnet Mask : 255.255.255.0
> MAC 

Re: [ClusterLabs] Never join a list without a problem...

2017-03-01 Thread Kai Dupke
On 02/27/2017 02:26 PM, Jeffrey Westgate  wrote:
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.

So, you have a time window of ~1h where the system is under load, right?
This is somewhat different to what Ulrich had, but his approach might be
useful for you, too.

Something against running some monitoring and capturing the processes,
process states and load say, every 5 minutes?

Of course, the peaks might correlate to something in the logs - like
cron, logins, logrotates or whatever.

regards,
Kai Dupke
Senior Product Manager
SUSE Linux Enterprise 13
-- 
Sell not virtue to purchase wealth, nor liberty to purchase power.
Phone:  +49-(0)5102-9310828 Mail: kdu...@suse.com
Mobile: +49-(0)173-5876766  WWW:  www.suse.com

SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Never join a list without a problem...

2017-03-01 Thread Ulrich Windl
>>> Jeffrey Westgate  schrieb am 27.02.2017 um 
>>> 14:26
in Nachricht
:
> Thanks, Ken. 
> 
> Our late guru was the admin who set all this up, and it's been rock solid 
> until recent oddities started cropping up.  They still function fine - 
> they've 
> just developed some... quirks.
> 
> I found the solution before I got your reply, which was essentially what we 
> did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, 
> reboot.  That process was necessary because they've been running sooo long, 
> pacemaker would not stop.  it would try, then seemingly stall after several 
> minutes.
> 
> We're good now, up-to-date-wise, and stuck only with the initial issue we 
> were 
> hoping to eliminate by updating/patching EVERYthing.  And we honestly don't 
> know what may be causing it.
> 
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.  (attached hostload.pdf)  This happens to both 
> machines, randomly, and is concerning, as we'd like to find what's causing it 
> and resolve it.

We use SLES11 here, and it took me a really long time to find out what is 
causing nightly load peaks on our servers. It turned out tho be the rebuild of 
the manual database (mandb). It didn't show in Nagios load statistics, but in 
monit alerts (on some machines we use both). In monit you can run a script when 
some condition is met. So  I constructed a "capture script" to find the guilty 
parties ;-)

However the peaks were so short that it took many runs to find it. Here the 
load was back to normal already, but monit had reported an event like "cpu 
system usage of 30.2% matches resource limit [cpu system usage>20.0%]":

Sat May 11 01:31:13 CEST 2013
top - 01:31:14 up 2 days,  9:31,  0 users,  load average: 0.91, 0.31, 0.15
Tasks: 114 total,   2 running, 112 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1065628k total,  1055292k used,10336k free,   143708k buffers
Swap:  2097148k total,0k used,  2097148k free,   578736k cached

  PID USER  PR  NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
 2832 root  20   0  8916 1060  776 R  0  0.1   0:00.00 top
 2910 man   30  10 840 R  0  0.0   0:00.00 mandb

Maybe this helps.

Regards,
Ulrich

> 
> We were hoping "uptime kernel bug", but patching has not helped.  There 
> seems to be no increase in the number of processes running, and the processes 
> running do not take any more cpu time.  They are DNS forwarding resolvers, 
> but there is no correlation between dns requests and load increase - 
> sometimes 
> (like this morning) it rises around 1 AM when the dns load is minimal.
> 
> The oddity is - these are the only two boxes with this issue, and we have a 
> couple dozen at the same OS and level.  Only these two, with this role and 
> this particular package set have the issue.
> 
> --
> Jeff





___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org