Re: [Linux-HA] file system resource becomes inaccesible when any of the node goes down
when a node goes down, you will see the node in unclean state, how you see in your logs, forming new configuration(corosync) - stonith reboot request - and you are using sbd and the node become offline after thet msgwait is expired, when msgwait is expired pacemaker knows the node is dead and than the ocfs will ok, stonith-dlm-ocfs2, if you need to reduce the msgwait you need to be careful about others timeouts and cluster problems. 2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin m.sharfud...@nds.com.pk: SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70 openais-1.1.4-5.22.1.7) Its a dual primary drbd cluster, which mounts a file system resource on both the cluster nodes simultaneously(file system type is ocfs2). Whenever any of the nodes goes down, the file system(/sharedata) become inaccessible for exact 35 seconds on the other (surviving/online) node, and then become available again on the online node. Please help me understand why the node which survives or remains online unable to access the file system resource(/sharedata) for 35 seconds ? and how can I fix the cluster so that file system remains accessible on the surviving node without any interruption/delay(as in my case of about 35 seconds) By inaccessible, I meant to say that running ls -l /sharedata and df /sharedata does not return any output and does not return the prompt back on the online node for exact 35 seconds once the other node becomes offline. e.g node1 got offline somewhere around 01:37:15, and then /sharedata file system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e node2. /var/log/messages on node2, when node1 went offline: Jul 5 01:37:26 node2 kernel: [ 675.255865] drbd r0: PingAck did not arrive in time. Jul 5 01:37:26 node2 kernel: [ 675.255886] drbd r0: peer( Primary - Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown ) Jul 5 01:37:26 node2 kernel: [ 675.256030] block drbd0: new current UUID C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B Jul 5 01:37:26 node2 kernel: [ 675.256079] drbd r0: asender terminated Jul 5 01:37:26 node2 kernel: [ 675.256081] drbd r0: Terminating drbd_a_r0 Jul 5 01:37:26 node2 kernel: [ 675.256306] drbd r0: Connection closed Jul 5 01:37:26 node2 kernel: [ 675.256338] drbd r0: conn( NetworkFailure - Unconnected ) Jul 5 01:37:26 node2 kernel: [ 675.256339] drbd r0: receiver terminated Jul 5 01:37:26 node2 kernel: [ 675.256340] drbd r0: Restarting receiver thread Jul 5 01:37:26 node2 kernel: [ 675.256341] drbd r0: receiver (re)started Jul 5 01:37:26 node2 kernel: [ 675.256344] drbd r0: conn( Unconnected - WFConnection ) Jul 5 01:37:29 node2 corosync[4040]: [TOTEM ] A processor failed, forming new configuration. Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.131) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 216: memb=1, new=0, lost=1 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: memb: node2 739307908 Jul 5 01:37:35 node2 corosync[4040]: [pcmk ] info: pcmk_peer_update: lost: node1 739307907 Jul 5 01:37:35 node2 corosync[4040]: [CLM ] CLM CONFIGURATION CHANGE Jul 5 01:37:35 node2 corosync[4040]: [CLM ] New Configuration: Jul 5 01:37:35 node2 corosync[4040]: [CLM ] r(0) ip(172.16.241.132) Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Left: Jul 5 01:37:35 node2 crmd[4050]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 stonith-ng[4046]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cib[4045]: notice: plugin_handle_membership: Membership 216: quorum lost Jul 5 01:37:35 node2 cluster-dlm[4344]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 ocfs2_controld[4473]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was member) Jul 5 01:37:35 node2 corosync[4040]: [CLM ] Members Joined: Jul 5 01:37:35 node2 crmd[4050]: warning: match_down_event: No match for shutdown action on node1 Jul 5 01:37:35 node2 stonith-ng[4046]: notice: crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - state is now lost (was
Re: [Linux-HA] pacemaker/heartbeat LVM
please use pastebin and show your whole logs 2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com: by the way.. just to note that.. for a normal testing (manual failover, rebooting the active node)... the cluster is working fine. I only encounter this error if I try to poweroff/shutoff the active node. On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com wrote: Hi. Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 13:47:16 s1 crmd[1515]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=40, confirmed=true) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist from the LVM agent, it checked if the volume is already available.. and will raise the above error if not. But, I don't see that it tries to activate it before raising the VG. Perhaps, it assumes that the VG is already activated... so, I'm not sure who should be activating it (should it be LVM?). if [ $rc -ne 0 ]; then ocf_log $loglevel LVM Volume $1 is not available (stopped) rc=$OCF_NOT_RUNNING else case $(get_vg_mode) in 1) # exclusive with tagging. # If vg is running, make sure the correct tag is present. Otherwise we # can not guarantee exclusive activation. if ! check_tags; then ocf_exit_reason WARNING: $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\ On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com wrote: logs? 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, just want to ask regarding the LVM resource agent on pacemaker/corosync. I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster works as expected, like doing a manual failover (via crm resource move), and automatic failover (by rebooting the active node for instance). But, if i try to just shutoff the active node (it's a VM, so I can do a poweroff). The resources won't be able to failover to the passive node. when I did an investigation, it's due to an LVM resource not starting (specifically, the VG). I found out that the LVM resource won't try to activate the volume group in the passive node. Is this an expected behaviour? what I really expect is that, in the event that the active node be shutoff (by a power outage for instance), all resources should be failover automatically to the passive. LVM should re-activate the VG. here's my config. node 1: s1 node 2: s2 primitive cluIP IPaddr2 \ params ip=192.168.13.200 cidr_netmask=32 \ op monitor interval=30s primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ op monitor timeout=90 interval=30 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=90s on-fail=ignore \ op start interval=0 timeout=90 primitive fs1 Filesystem \ params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs primitive mariadb mysql \ params config=/etc/my.cnf primitive sbd stonith:external/sbd \ op monitor interval=15s timeout=60s primitive vg1 LVM \ params volgrpname=cluvg1 exclusive=yes \ op start timeout=10s interval=0 \ op stop interval=0 timeout=10 \ op monitor interval=10 timeout=30 on-fail=restart depth=0 group base-group dlm clvm group rgroup cluIP vg1 fs1 mariadb \ meta target-role=Started clone base-clone base-group \ meta interleave=true target-role=Started property cib-bootstrap-options: \ dc-version=1.1.12-1.1.12.git20140904.266d5c2 \ cluster-infrastructure=corosync \ no-quorum-policy=ignore \ last-lrm-refresh=1419514875 \ cluster-name=xxx \ stonith-enabled=true rsc_defaults rsc-options: \ resource-stickiness=100 -- import this ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- import this -- import this ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera
Re: [Linux-HA] pacemaker/heartbeat LVM
Sorry, But your paste is empty. 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com: hi, uploaded it here. http://susepaste.org/45413433 thanks. On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com wrote: Ok, i attached the log file of one of the nodes. On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com wrote: please use pastebin and show your whole logs 2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com: by the way.. just to note that.. for a normal testing (manual failover, rebooting the active node)... the cluster is working fine. I only encounter this error if I try to poweroff/shutoff the active node. On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com wrote: Hi. Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 13:47:16 s1 crmd[1515]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=40, confirmed=true) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist from the LVM agent, it checked if the volume is already available.. and will raise the above error if not. But, I don't see that it tries to activate it before raising the VG. Perhaps, it assumes that the VG is already activated... so, I'm not sure who should be activating it (should it be LVM?). if [ $rc -ne 0 ]; then ocf_log $loglevel LVM Volume $1 is not available (stopped) rc=$OCF_NOT_RUNNING else case $(get_vg_mode) in 1) # exclusive with tagging. # If vg is running, make sure the correct tag is present. Otherwise we # can not guarantee exclusive activation. if ! check_tags; then ocf_exit_reason WARNING: $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\ On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com wrote: logs? 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, just want to ask regarding the LVM resource agent on pacemaker/corosync. I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster works as expected, like doing a manual failover (via crm resource move), and automatic failover (by rebooting the active node for instance). But, if i try to just shutoff the active node (it's a VM, so I can do a poweroff). The resources won't be able to failover to the passive node. when I did an investigation, it's due to an LVM resource not starting (specifically, the VG). I found out that the LVM resource won't try to activate the volume group in the passive node. Is this an expected behaviour? what I really expect is that, in the event that the active node be shutoff (by a power outage for instance), all resources should be failover automatically to the passive. LVM should re-activate the VG. here's my config. node 1: s1 node 2: s2 primitive cluIP IPaddr2 \ params ip=192.168.13.200 cidr_netmask=32 \ op monitor interval=30s primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ op monitor timeout=90 interval=30 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=90s on-fail=ignore \ op start interval=0 timeout=90 primitive fs1 Filesystem \ params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs primitive mariadb mysql \ params config=/etc/my.cnf primitive sbd stonith:external/sbd \ op monitor interval=15s timeout=60s primitive vg1 LVM \ params volgrpname=cluvg1 exclusive=yes \ op start timeout=10s interval=0 \ op stop interval=0 timeout=10 \ op monitor interval=10 timeout=30 on-fail=restart depth=0 group base-group dlm clvm group rgroup cluIP vg1 fs1 mariadb \ meta target-role=Started clone base-clone base-group \ meta interleave=true target-role=Started property cib-bootstrap-options: \ dc-version=1.1.12-1.1.12.git20140904.266d5c2 \ cluster-infrastructure=corosync \ no-quorum-policy=ignore \ last-lrm-refresh=1419514875 \ cluster-name=xxx \ stonith-enabled=true rsc_defaults rsc-options: \ resource-stickiness=100 -- import this ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera
Re: [Linux-HA] pacemaker/heartbeat LVM
Hi, You have a problem with the cluster stonithd:error: crm_abort: crm_glib_handler: Forked child 6186 to record non-fatal assert at logging.c:73 Try to post your cluster version(packages), maybe someone can tell you if this is a known bug or new. 2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com: ok, sorry for that.. please use this instead. http://pastebin.centos.org/14771/ thanks. On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com wrote: Sorry, But your paste is empty. 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com: hi, uploaded it here. http://susepaste.org/45413433 thanks. On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com wrote: Ok, i attached the log file of one of the nodes. On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com wrote: please use pastebin and show your whole logs 2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com: by the way.. just to note that.. for a normal testing (manual failover, rebooting the active node)... the cluster is working fine. I only encounter this error if I try to poweroff/shutoff the active node. On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com wrote: Hi. Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 13:47:16 s1 crmd[1515]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=40, confirmed=true) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist from the LVM agent, it checked if the volume is already available.. and will raise the above error if not. But, I don't see that it tries to activate it before raising the VG. Perhaps, it assumes that the VG is already activated... so, I'm not sure who should be activating it (should it be LVM?). if [ $rc -ne 0 ]; then ocf_log $loglevel LVM Volume $1 is not available (stopped) rc=$OCF_NOT_RUNNING else case $(get_vg_mode) in 1) # exclusive with tagging. # If vg is running, make sure the correct tag is present. Otherwise we # can not guarantee exclusive activation. if ! check_tags; then ocf_exit_reason WARNING: $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\ On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com wrote: logs? 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, just want to ask regarding the LVM resource agent on pacemaker/corosync. I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster works as expected, like doing a manual failover (via crm resource move), and automatic failover (by rebooting the active node for instance). But, if i try to just shutoff the active node (it's a VM, so I can do a poweroff). The resources won't be able to failover to the passive node. when I did an investigation, it's due to an LVM resource not starting (specifically, the VG). I found out that the LVM resource won't try to activate the volume group in the passive node. Is this an expected behaviour? what I really expect is that, in the event that the active node be shutoff (by a power outage for instance), all resources should be failover automatically to the passive. LVM should re-activate the VG. here's my config. node 1: s1 node 2: s2 primitive cluIP IPaddr2 \ params ip=192.168.13.200 cidr_netmask=32 \ op monitor interval=30s primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ op monitor timeout=90 interval=30 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=90s on-fail=ignore \ op start interval=0 timeout=90 primitive fs1 Filesystem \ params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs primitive mariadb mysql \ params config=/etc/my.cnf primitive sbd stonith:external/sbd \ op monitor interval=15s timeout=60s primitive vg1 LVM \ params volgrpname=cluvg1 exclusive=yes \ op start timeout=10s interval=0 \ op stop interval=0 timeout=10 \ op monitor interval=10 timeout=30 on-fail=restart depth=0 group base-group dlm clvm group rgroup cluIP vg1 fs1 mariadb \ meta target-role=Started clone base-clone base-group \ meta
Re: [Linux-HA] pacemaker/heartbeat LVM
Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity /var/lib/pacemaker/cib/cib.xml Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot link /var/lib/pacemaker/cib/cib.xml to /var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1) 2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com: Hi, You have a problem with the cluster stonithd:error: crm_abort: crm_glib_handler: Forked child 6186 to record non-fatal assert at logging.c:73 Try to post your cluster version(packages), maybe someone can tell you if this is a known bug or new. 2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com: ok, sorry for that.. please use this instead. http://pastebin.centos.org/14771/ thanks. On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com wrote: Sorry, But your paste is empty. 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com: hi, uploaded it here. http://susepaste.org/45413433 thanks. On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com wrote: Ok, i attached the log file of one of the nodes. On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com wrote: please use pastebin and show your whole logs 2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com: by the way.. just to note that.. for a normal testing (manual failover, rebooting the active node)... the cluster is working fine. I only encounter this error if I try to poweroff/shutoff the active node. On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com wrote: Hi. Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 13:47:16 s1 crmd[1515]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=40, confirmed=true) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 13:47:16 s1 crmd[1515]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist from the LVM agent, it checked if the volume is already available.. and will raise the above error if not. But, I don't see that it tries to activate it before raising the VG. Perhaps, it assumes that the VG is already activated... so, I'm not sure who should be activating it (should it be LVM?). if [ $rc -ne 0 ]; then ocf_log $loglevel LVM Volume $1 is not available (stopped) rc=$OCF_NOT_RUNNING else case $(get_vg_mode) in 1) # exclusive with tagging. # If vg is running, make sure the correct tag is present. Otherwise we # can not guarantee exclusive activation. if ! check_tags; then ocf_exit_reason WARNING: $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\ On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com wrote: logs? 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, just want to ask regarding the LVM resource agent on pacemaker/corosync. I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster works as expected, like doing a manual failover (via crm resource move), and automatic failover (by rebooting the active node for instance). But, if i try to just shutoff the active node (it's a VM, so I can do a poweroff). The resources won't be able to failover to the passive node. when I did an investigation, it's due to an LVM resource not starting (specifically, the VG). I found out that the LVM resource won't try to activate the volume group in the passive node. Is this an expected behaviour? what I really expect is that, in the event that the active node be shutoff (by a power outage for instance), all resources should be failover automatically to the passive. LVM should re-activate the VG. here's my config. node 1: s1 node 2: s2 primitive cluIP IPaddr2 \ params ip=192.168.13.200 cidr_netmask=32 \ op monitor interval=30s primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ op monitor timeout=90 interval=30 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=90s on-fail=ignore \ op start interval=0 timeout=90 primitive fs1 Filesystem \ params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs primitive mariadb mysql \ params config=/etc/my.cnf primitive sbd
Re: [Linux-HA] pacemaker/heartbeat LVM
Dlm isn't the problem, but i think is your fencing, when you powered off the active node, the dead remain in unclean state? can you show me your sbd timeouts? sbd -d /dev/path_of_your_device dump Thanks 2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, ah yeah.. tried to poweroff the active node.. and tried pvscan on the passive.. and yes.. it didn't worked --- it doesn't return to the shell. So, the problem is on DLM? On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com wrote: Power off the active node and after one seconde try to use one lvm command, for example pvscan, if this command doesn't response is because dlm relay on cluster fencing, if the cluster fencing doesn't work the dlm state in blocked state. 2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com: perhaps, we need to focus on this message. as mentioned.. the cluster is working fine under normal circumstances. my only concern is that, LVM resource agent doesn't try to re-activate the VG on the passive node when the active node goes down ungracefully (powered off). Hence, it could not mount the filesystems.. etc. Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation sbd_monitor_0: not running (node= s1, call=5, rc=7, cib-update=35, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 13: monitor dlm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 5: monitor dlm:1_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation dlm_monitor_0: not running (node= s1, call=10, rc=7, cib-update=36, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 14: monitor clvm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 6: monitor clvm:1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation clvm_monitor_0: not running (node =s1, call=15, rc=7, cib-update=37, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 15: monitor cluIP_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 7: monitor cluIP_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation cluIP_monitor_0: not running (nod e=s1, call=19, rc=7, cib-update=38, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 8: monitor vg1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=39, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 17: monitor fs1_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 Filesystem(fs1)[1600]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation fs1_monitor_0: not running (node= s1, call=27, rc=7, cib-update=40, confirmed=true) On Mon, Dec 29, 2014 at 5:38 PM, emmanuel segura emi2f...@gmail.com wrote: Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity /var/lib/pacemaker/cib/cib.xml Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot link /var/lib/pacemaker/cib/cib.xml to /var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1) 2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com: Hi, You have a problem with the cluster stonithd:error: crm_abort: crm_glib_handler: Forked child 6186 to record non-fatal assert at logging.c:73 Try to post your cluster version(packages), maybe someone can tell you if this is a known bug or new. 2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com: ok, sorry for that.. please use this instead. http://pastebin.centos.org/14771/ thanks. On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com wrote: Sorry, But your paste is empty. 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com: hi, uploaded it here. http://susepaste.org/45413433 thanks. On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com wrote: Ok, i attached the log file of one of the nodes. On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f
Re: [Linux-HA] pacemaker/heartbeat LVM
https://bugzilla.redhat.com/show_bug.cgi?id=1127289#c4 https://bugzilla.redhat.com/show_bug.cgi?id=1127289 2014-12-29 11:57 GMT+01:00 Marlon Guao marlon.g...@gmail.com: here it is.. ==Dumping header on disk /dev/mapper/sbd Header version : 2.1 UUID : 36074673-f48e-4da2-b4ee-385e83e6abcc Number of slots: 255 Sector size: 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10 On Mon, Dec 29, 2014 at 6:42 PM, emmanuel segura emi2f...@gmail.com wrote: Dlm isn't the problem, but i think is your fencing, when you powered off the active node, the dead remain in unclean state? can you show me your sbd timeouts? sbd -d /dev/path_of_your_device dump Thanks 2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, ah yeah.. tried to poweroff the active node.. and tried pvscan on the passive.. and yes.. it didn't worked --- it doesn't return to the shell. So, the problem is on DLM? On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com wrote: Power off the active node and after one seconde try to use one lvm command, for example pvscan, if this command doesn't response is because dlm relay on cluster fencing, if the cluster fencing doesn't work the dlm state in blocked state. 2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com: perhaps, we need to focus on this message. as mentioned.. the cluster is working fine under normal circumstances. my only concern is that, LVM resource agent doesn't try to re-activate the VG on the passive node when the active node goes down ungracefully (powered off). Hence, it could not mount the filesystems.. etc. Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation sbd_monitor_0: not running (node= s1, call=5, rc=7, cib-update=35, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 13: monitor dlm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 5: monitor dlm:1_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation dlm_monitor_0: not running (node= s1, call=10, rc=7, cib-update=36, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 14: monitor clvm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 6: monitor clvm:1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation clvm_monitor_0: not running (node =s1, call=15, rc=7, cib-update=37, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 15: monitor cluIP_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 7: monitor cluIP_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation cluIP_monitor_0: not running (nod e=s1, call=19, rc=7, cib-update=38, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 8: monitor vg1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=39, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 17: monitor fs1_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 9: monitor fs1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 Filesystem(fs1)[1600]: WARNING: Couldn't find device [/dev/mapper/cluvg1-clulv1]. Ex pected /dev/??? to exist Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation fs1_monitor_0: not running (node= s1, call=27, rc=7, cib-update=40, confirmed=true) On Mon, Dec 29, 2014 at 5:38 PM, emmanuel segura emi2f...@gmail.com wrote: Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error: Permission deniedPermission deniedI/O warning : failed to load external entity /var/lib/pacemaker/cib/cib.xml Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot link /var/lib/pacemaker/cib/cib.xml to /var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1) 2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com: Hi, You have a problem with the cluster stonithd:error: crm_abort: crm_glib_handler: Forked child 6186 to record non-fatal assert at logging.c:73 Try to post your cluster version(packages), maybe someone can tell you if this is a known bug or new
Re: [Linux-HA] pacemaker/heartbeat LVM
you have quorum-policy=ignore, in the thread you posted: Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status 1084811078 receive -125 from 1084811079 walltime 1416819310 local 747 {lvm}-{clvmd}-{dlm}-{fencing} = if fencing isn't working :) your cluster will be broken. 2014-12-29 15:46 GMT+01:00 Marlon Guao marlon.g...@gmail.com: looks like it's similar to this as well. http://comments.gmane.org/gmane.linux.highavailability.pacemaker/22398 but, could it be because, clvm is not activating the vg on the passive node, because it's waiting for quorum? seeing this on the log as well. Dec 29 21:18:09 s2 dlm_controld[1776]: 8544 fence work wait for quorum Dec 29 21:18:12 s2 dlm_controld[1776]: 8547 clvmd wait for quorum On Mon, Dec 29, 2014 at 9:24 PM, Marlon Guao marlon.g...@gmail.com wrote: interesting, i'm using the newer pacemaker version.. pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64 On Mon, Dec 29, 2014 at 8:11 PM, emmanuel segura emi2f...@gmail.com wrote: https://bugzilla.redhat.com/show_bug.cgi?id=1127289#c4 https://bugzilla.redhat.com/show_bug.cgi?id=1127289 2014-12-29 11:57 GMT+01:00 Marlon Guao marlon.g...@gmail.com: here it is.. ==Dumping header on disk /dev/mapper/sbd Header version : 2.1 UUID : 36074673-f48e-4da2-b4ee-385e83e6abcc Number of slots: 255 Sector size: 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10 On Mon, Dec 29, 2014 at 6:42 PM, emmanuel segura emi2f...@gmail.com wrote: Dlm isn't the problem, but i think is your fencing, when you powered off the active node, the dead remain in unclean state? can you show me your sbd timeouts? sbd -d /dev/path_of_your_device dump Thanks 2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, ah yeah.. tried to poweroff the active node.. and tried pvscan on the passive.. and yes.. it didn't worked --- it doesn't return to the shell. So, the problem is on DLM? On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com wrote: Power off the active node and after one seconde try to use one lvm command, for example pvscan, if this command doesn't response is because dlm relay on cluster fencing, if the cluster fencing doesn't work the dlm state in blocked state. 2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com: perhaps, we need to focus on this message. as mentioned.. the cluster is working fine under normal circumstances. my only concern is that, LVM resource agent doesn't try to re-activate the VG on the passive node when the active node goes down ungracefully (powered off). Hence, it could not mount the filesystems.. etc. Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation sbd_monitor_0: not running (node= s1, call=5, rc=7, cib-update=35, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 13: monitor dlm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 5: monitor dlm:1_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation dlm_monitor_0: not running (node= s1, call=10, rc=7, cib-update=36, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 14: monitor clvm:0_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 6: monitor clvm:1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation clvm_monitor_0: not running (node =s1, call=15, rc=7, cib-update=37, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 15: monitor cluIP_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 7: monitor cluIP_monitor_0 o n s1 (local) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation cluIP_monitor_0: not running (nod e=s1, call=19, rc=7, cib-update=38, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 16: monitor vg1_monitor_0 on s2 Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action 8: monitor vg1_monitor_0 on s1 (local) Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is not available (stopped) Dec 29 17:12:26 s1 crmd[1495]: notice: process_lrm_event: Operation vg1_monitor_0: not running (node= s1, call=23, rc=7, cib-update=39, confirmed=true) Dec 29 17:12:26 s1 crmd[1495]: notice: te_rsc_command: Initiating action
Re: [Linux-HA] pacemaker/heartbeat LVM
logs? 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com: Hi, just want to ask regarding the LVM resource agent on pacemaker/corosync. I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster works as expected, like doing a manual failover (via crm resource move), and automatic failover (by rebooting the active node for instance). But, if i try to just shutoff the active node (it's a VM, so I can do a poweroff). The resources won't be able to failover to the passive node. when I did an investigation, it's due to an LVM resource not starting (specifically, the VG). I found out that the LVM resource won't try to activate the volume group in the passive node. Is this an expected behaviour? what I really expect is that, in the event that the active node be shutoff (by a power outage for instance), all resources should be failover automatically to the passive. LVM should re-activate the VG. here's my config. node 1: s1 node 2: s2 primitive cluIP IPaddr2 \ params ip=192.168.13.200 cidr_netmask=32 \ op monitor interval=30s primitive clvm ocf:lvm2:clvmd \ params daemon_timeout=30 \ op monitor timeout=90 interval=30 primitive dlm ocf:pacemaker:controld \ op monitor interval=60s timeout=90s on-fail=ignore \ op start interval=0 timeout=90 primitive fs1 Filesystem \ params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs primitive mariadb mysql \ params config=/etc/my.cnf primitive sbd stonith:external/sbd \ op monitor interval=15s timeout=60s primitive vg1 LVM \ params volgrpname=cluvg1 exclusive=yes \ op start timeout=10s interval=0 \ op stop interval=0 timeout=10 \ op monitor interval=10 timeout=30 on-fail=restart depth=0 group base-group dlm clvm group rgroup cluIP vg1 fs1 mariadb \ meta target-role=Started clone base-clone base-group \ meta interleave=true target-role=Started property cib-bootstrap-options: \ dc-version=1.1.12-1.1.12.git20140904.266d5c2 \ cluster-infrastructure=corosync \ no-quorum-policy=ignore \ last-lrm-refresh=1419514875 \ cluster-name=xxx \ stonith-enabled=true rsc_defaults rsc-options: \ resource-stickiness=100 -- import this ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Oracle OCF Script throws SP2-0640: Not connected
i'm using resource-agents-3.9.2-0.25.5 on Suse 11 Sp2 and i don't have any . /usr/lib/ocf/lib/heartbeat/ora-common.sh maybe you need to create a new database user ++ local 'conn_s=connect OCFMON/OCFMON' ++ shift 1 ++ local func ++ echo 'connect OCFMON/OCFMON' 2014-08-01 18:40 GMT+02:00 Wendt Christian christian.we...@bosch-si.com: Hello *, i did a lot of research but i´m not able to get the purpose whether our oracle ressource fails on Wednesday. Oracle start will fail with the message: /usr/lib/ocf/resource.d/heartbeat/oracle start INFO: orcSNBGW instance state is not OPEN (dbstat output: SP2-0640: Not connected) ERROR: oracle instance orcSNBGW not started: Showdbstat throws: /usr/lib/ocf/resource.d/heartbeat/oracle showdbstat Full output: SP2-0640: Not connected Stripped output: OPEN So the first method of showdbstat monitoring the DB fails, but the second one succeeds. It is not possible to start oracle within the pacemaker cluster anymore. Everytime we start it, it´ll fail. I´ve attached the bash output while starting oracle with the ocf script. Database and OS is fine. Nothing changed in the last days. Do have any ideas? Thank you in advance. Mit freundlichen Grüßen / Best regards Christian Wendt Bosch Software Innovations GmbH Information Technology (INST/PRV3-IT) Schöneberger Ufer 89-91 10785 Berlin GERMANY www.bosch-si.de www.blog.bosch-si.com Tel. +49 30.72 61 12-308 Fax +49 30.72.61.12-100 christian.we...@bosch-si.com Registered office: Berlin, Register court: Amtsgericht Charlottenburg, HRB 148411 B Executives: Rainer Kallenbach; Thomas Cotic, Michael Hahn, Klaus Hüftle ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Managed Failovers w/ NFS HA Cluster
but the nfs failover works now? 2014-07-22 2:10 GMT+02:00 Charles Taylor chas...@ufl.edu: On Jul 21, 2014, at 10:40 AM, Charles Taylor wrote: As I write this, I'm thinking that perhaps the way to achieve this is to change the order of the services so that the VIP is started last and stopped first when stopping/starting the resource group. That should make it appear to the client that the server just went away as would happen in a failure scenario. Then the client should not know that the file system has been unexported since it can't talk to the server. Perhaps, I just made a rookie mistake in the ordering of the services within the resource group. I'll try that and report back. Yep, this was my mistake. The IPaddr2 primitive needs to follow the exportfs primitives in my resource group so they now are arranged as Resource Group: grp_b3v0 vg_b3v0(ocf::heartbeat:LVM) Started fs_b3v0(ocf::heartbeat:Filesystem) Started ex_b3v0_1 (ocf::heartbeat:exportfs) Started ex_b3v0_2 (ocf::heartbeat:exportfs) Started ex_b3v0_3 (ocf::heartbeat:exportfs) Started ex_b3v0_4 (ocf::heartbeat:exportfs) Started ex_b3v0_5 (ocf::heartbeat:exportfs) Started ex_b3v0_6 (ocf::heartbeat:exportfs) Started ip_vbio3 (ocf::heartbeat:IPaddr2) Started Thanks to those who responded, Charlie Taylor UF Research Computing ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD on CentOS7
depmod -a modprobe drbd ? 2014-07-18 13:05 GMT+02:00 willi.feh...@t-online.de willi.feh...@t-online.de: Hello, I'm trying to use DRBD on CentOS7. It looks like RedHat hasn't compiled DRBD into the Kernel. So I downloaded the source rpm from Fedora 19 and created my own rpm. [root@centos7 ~]# rpm -qa | grep drbd drbd-utils-8.4.3-2.el7.centos.x86_64 drbd-8.4.3-2.el7.centos.x86_64 drbd-udev-8.4.3-2.el7.centos.x86_64 But I cannot load the drbd kernel module: [root@centos7 ~]# modprobe drbd modprobe: FATAL: Module drbd not found. Regards - Willi ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD on CentOS7
rpm -ql drbd-8.4.3-2.el7.centos.x86_64 2014-07-18 16:31 GMT+02:00 Alessandro Baggi alessandro.ba...@gmail.com: I'm new to CentOS, and more new on CentOS 7. Maybe you have not compiled drbd module. Reading on drbd site you must prepare kernel source tree and supply --with-km to compile also kernel module. I'm running CentOS 6.5, and for drbd-suite I use elrepo (supports also el7) and works very good. http://elrepo.org/tiki/tiki-index.php depmod -a modprobe drbd ? 2014-07-18 13:05 GMT+02:00 willi.feh...@t-online.de willi.feh...@t-online.de: Hello, I'm trying to use DRBD on CentOS7. It looks like RedHat hasn't compiled DRBD into the Kernel. So I downloaded the source rpm from Fedora 19 and created my own rpm. [root@centos7 ~]# rpm -qa | grep drbd drbd-utils-8.4.3-2.el7.centos.x86_64 drbd-8.4.3-2.el7.centos.x86_64 drbd-udev-8.4.3-2.el7.centos.x86_64 But I cannot load the drbd kernel module: [root@centos7 ~]# modprobe drbd modprobe: FATAL: Module drbd not found. Regards - Willi ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] troublesome DRBD resources on CentOS 6.5
no logs! 2014-06-05 14:56 GMT+02:00 Bart Coninckx bart.conin...@telenet.be: Hi all, I have some DRBD resources on CentoOS 6.5 which refuse to start. A message I get in Hawk and in /var/log/messages is: Failed op: node=storage3, resource=p_drbd_ws021, call-id=73, operation=monitor, rc-code=6 I am able to start the DRBD resources manually. I figured out that code 6 means configuration error, but I don't see where. DRBD and it's resource agent are installed from source (drbd-8.4.4). This is the relevant cluster configuration, which I took from the Clusters from Scratch document: primitive p_drbd_ws021 ocf:linbit:drbd \ params drbd_resource=ws021 drbdconf=/etc/drbd.conf \ op monitor interval=60 timeout=20 ms ms_drbd_ws021 p_drbd_ws021 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started Any tips or hints are most welcome, because I have been looking at this thing for two days and still no progress, thanks! BC ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Packemaker resources for Galera cluster
If you have the ClustetIP resource in g_mysql, i think you don't need order order_mysql_before_ip Mandatory: p_mysql ClusterIP because the group is ordered by default, if you wanna mysql running on all boxes, use clone resource and a colocation constraint to put ip on a box with a mysql instance actived 2014-06-05 6:48 GMT+02:00 Razvan Oncioiu ronci...@gmail.com: Hello, I can't seem to find a proper way of setting up resources in pacemaker to manager my Galera cluster. I want a VIP that will failover betwen 5 boxes ( this works ), but I would also like to tie this into a resources that monitors mysql as well. if a mysql instance goes down, the VIP should move to another box that has mysql actually running. But I do not want pacemaker to start or stop the mysql service. Here is my current configuration: node galera01 node galera02 node galera03 node galera04 node galera05 primitive ClusterIP IPaddr2 \ params ip=10.10.10.178 cidr_netmask=24 \ meta is-managed=true \ op monitor interval=5s primitive p_mysql mysql \ params pid=/var/lib/mysql/mysqld.pid test_user=root test_passwd=goingforbroke \ meta is-managed=false \ op monitor interval=5s OCF_CHECK_LEVEL=10 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s on-fail=standby group g_mysql p_mysql ClusterIP order order_mysql_before_ip Mandatory: p_mysql ClusterIP property cib-bootstrap-options: \ dc-version=1.1.10-14.el6_5.3-368c726 \ cluster-infrastructure=classic openais (with plugin) \ stonith-enabled=false \ no-quorum-policy=ignore \ expected-quorum-votes=5 \ last-lrm-refresh=1401942846 rsc_defaults rsc-options: \ resource-stickiness=100 -- View this message in context: http://linux-ha.996297.n3.nabble.com/Packemaker-resources-for-Galera-cluster-tp15668.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK
the first thing, you are using no_path_retry in wrong way in your multipath, try to read this http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html 2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com: I have attached the config files to this e-mail. The sbd dump is below [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump ==Dumping header on disk /dev/mapper/qa-xen-sbd Header version : 2.1 UUID : ae835596-3d26-4681-ba40-206b4d51149b Number of slots: 255 Sector size: 512 Timeout (watchdog) : 45 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 90 ==Header on disk /dev/mapper/qa-xen-sbd is dumped On 22/04/14 02:30 PM, emmanuel segura wrote: you are missingo cluster configuration and sbd configuration and multipath config 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com: Has anyone seen this? Do you know what might be causing the flapping? Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled. Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device /dev/mapper/qa-xen-sbd Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd uuid: ae835596-3d26-4681-ba40-206b4d51149b Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS quorum check enabled Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device: /dev/watchdog Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45 seconds. Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now. Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK
you are missingo cluster configuration and sbd configuration and multipath config 2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com: Has anyone seen this? Do you know what might be causing the flapping? Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled. Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device /dev/mapper/qa-xen-sbd Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd uuid: ae835596-3d26-4681-ba40-206b4d51149b Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS quorum check enabled Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device: /dev/watchdog Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45 seconds. Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with cluster ... Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now. Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:32:52 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:32:52 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:33:01 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:33:01 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 13:44:39 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 13:44:39 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 13:44:47 qaxen6 sbd: [12974]: info: Node state: online Apr 22 13:44:47 qaxen6 sbd: [12971]: info: Pacemaker health check: OK Apr 22 14:07:42 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated! Apr 22 14:07:42 qaxen6 sbd: [12971]: WARN: Pacemaker health check: UNHEALTHY Apr 22 14:07:51 qaxen6 sbd: [12974]: info: Node state: online Apr 22 14:07:51 qaxen6 sbd: [12971]: info: Pacemaker
Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA
The idea behind use exclusive volume activation mode with clvmd was(i think), have a vg active active and lvs opens just in one node, more lvm metadata replicated on all cluster nodes, when you'll do a change like lvm resize. I have a redhat cluster with clvmd with vg active in exclusive mode, if you add pv to your volume group, every cluster node knows about the new pv in the vg, but anyway you cannot active the vg if it's active on other node, i think clvmd is needed just for replicate the lvm metadata 2013/12/2 Ulrich Windl ulrich.wi...@rz.uni-regensburg.de Lars Marowsky-Bree l...@suse.com schrieb am 29.11.2013 um 13:48 in Nachricht 20131129124833.gf22...@suse.de: On 2013-11-29T13:46:17, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: I just did s/true/false/... Was that a clustered volume? Clusterd exclusive=true ?? No! Then it can't work. Exclusive activation only works for clustered volume groups, since it uses the DLM to protect against the VG being activated more than once in the cluster. Hi! Try it with resource-agents-3.9.4-0.26.84: it works; with resource-agents-3.9.5-0.6.26.11 it doesn't work ;-) You could argue that it never should have worked. Anyway: If you want to activate a VG on exactly one node you should not need cLVM; only if you man to activate the VG on multiple nodes (as for a cluster file system)... Reagrds, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] trouble installing Heartbeat
maybe you are missing the uuid library 2013/12/1 John Williams john.1...@yahoo.com I'm trying to install heartbeat and I'm getting the following error with the cluster glue components during the make part of the build: /bin/sh ../../libtool --tag=CC --tag=CC --mode=link gcc -std=gnu99 -g -O2 -ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror -o ipctest ipctest.o libplumb.la ../../replace/libreplace.la ../../lib/pils/ libpils.la -lbz2 -lxml2 -lc -lrt -ldl -lglib-2.0 -lltdl libtool: link: gcc -std=gnu99 -g -O2 -ggdb3 -O0 -fgnu89-inline -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror -o .libs/ipctest ipctest.o ./.libs/libplumb.so /home/ssaleh/Reusable-Cluster-Components-glue--glue-1.0.9/lib/pils/.libs/libpils.so ../../replace/.libs/libreplace.a ../../lib/pils/.libs/libpils.so -lbz2 -lxml2 -lc -lrt -ldl -lglib-2.0 -lltdl ./.libs/libplumb.so: undefined reference to `uuid_parse' ./.libs/libplumb.so: undefined reference to `uuid_generate' ./.libs/libplumb.so: undefined reference to `uuid_copy' ./.libs/libplumb.so: undefined reference to `uuid_is_null' ./.libs/libplumb.so: undefined reference to `uuid_unparse' ./.libs/libplumb.so: undefined reference to `uuid_clear' ./.libs/libplumb.so: undefined reference to `uuid_compare' collect2: ld returned 1 exit status gmake[2]: *** [ipctest] Error 1 gmake[2]: Leaving directory `/home/john/Reusable-Cluster-Components-glue--glue-1.0.9/lib/clplumbing' gmake[1]: *** [all-recursive] Error 1 gmake[1]: Leaving directory `/home/john/Reusable-Cluster-Components-glue--glue-1.0.9/lib' make: *** [all-recursive] Error 1 [root@bigb1 Reusable-Cluster-Components-glue--glue-1.0.9]# How do I resolve this? Thanks in advance. J. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman-controlled cluster takes an hour to start !?
Put your cluster node hostnames in /etc/hosts and i think you are missing cman two_node=1 expected_votes=1/ in cluster.conf 2013/8/23 Jakob Curdes j...@info-systems.de Hmmm, the problem turns out to DNS-related. At startup, some of the virtual interfaces are inactive and the DNS servers are unreachable. And CMAN seems to do a lookup for all ip addresses on the machine; I have the names of all cluster members in the hosts file but not all names of all other addresses (i.e. the ones managed by the cluster). Anyway I wonder whiy even with -d64 it doesn't tell me anything about what it is doing. I think the timespan of an hour is just because we have lots of VLAN interfaces the he wants to get a DNS name for Regards, Jakob Curdes __**_ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Replacement of heartbeat on fc18 ?
yum install corosync pacemaker 2013/8/23 Francis SOUYRI francis.sou...@apec.fr Hi, Thank you but I do not find on yum OpenAIS or an rpm for OpenAIS for fc18, do you know where I can search. Best regards. Francis On 08/23/2013 03:58 PM, Nick Cameo wrote: Pacemaker+Corosync/OpenAIS http://clusterlabs.org/ N. __**_ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems __**_ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resource won't run on a specific node
Hello Can you show us crm configure show? thanks 2013/7/27 Miles Fidelman mfidel...@meetinghouse.net Hi Folks, Dual-node, pacemaker cluster, DRBD-backed xen virtual machines - one of our VMs will run on one node, but not the other, and crm status yields a failure message saying that starting the resource failed for unknown reasons. The log is only slightly less useless: (server2 and server3 are the nodes, server1 is the resource) server3, running server1, crashes node entries from server2 trying to failover the resource Jul 27 06:27:06 server2 pengine: [1365]: info: get_failcount: server1 has failed INFINITY times on server2 Jul 27 06:27:06 server2 pengine: [1365]: WARN: common_apply_stickiness: Forcing server1 away from server2 after 100 failures (max=100) Jul 27 06:27:06 server2 pengine: [1365]: info: native_color: Resource server1 cannot run anywhere Jul 27 06:27:06 server2 pengine: [1365]: notice: LogActions: Leave resource server1#011(Stopped) Attempts to migrate the server fail with the same errors. Failover USED to work just fine. It still works for other VMs. Any idea how to track down what's failing? Thanks very much, Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. Yogi Berra __**_ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Pacemaker - Resource dont get started on the standby node.
Hello Parkirat Thank you very much 2013/6/17 Parkirat parkiratba...@gmail.com Thanks Ulrich, I have figured out the problem. The actual problem was in the configuration file for the resource httpd. It was correct in the Master node but the configuration was missing in the standby node, which was not allowing it to start. Regards, Parkirat Singh Bagga. -- View this message in context: http://linux-ha.996297.n3.nabble.com/Pacemaker-Resource-dont-get-started-on-the-standby-node-tp14686p14695.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker - Resource dont get started on the standby node.
Hello Parkirat can you share with us what was the problem? maybe this can help others persons Thanks 2013/6/16 Parkirat parkiratba...@gmail.com I figured out the problem. Thanks and Regards, Parkirat Singh Bagga. -- View this message in context: http://linux-ha.996297.n3.nabble.com/Pacemaker-Resource-dont-get-started-on-the-standby-node-tp14686p14691.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start should be group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd_fs_after_drbd inf: ma-ms-drbd5:promote astorage:start 2013/6/6 Thomas Glanzmann tho...@glanzmann.de Hello, on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When putting multiple filesystems which depend on multiple drbd promotions, only the first drbd is promoted and the group never comes up. However when the promotions are not based on the individual filesystems but on the group or probably any single entity all drbds are promoted correctly. So to summarize: This only promotes the first drbd and the resource group never starts: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start # ~~ This works: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start # ~~ I would like to know if that is supposed to happen. If that is the case I would understand why this is the case. I assume it is a bug, but I'm not sure. Complete working config here: primitive astorage_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \ op monitor interval=60s primitive astorage1-fencing stonith:external/ipmi \ params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage2-fencing stonith:external/ipmi \ params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage_16_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \ op monitor interval=60s primitive drbd10 ocf:linbit:drbd \ params drbd_resource=r10 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd10_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd10 directory=/mnt/akvm/nfs fstype=ext4 \ op monitor interval=60s primitive drbd3 ocf:linbit:drbd \ params drbd_resource=r3 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd4 ocf:linbit:drbd \ params drbd_resource=r4 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5 ocf:linbit:drbd \ params drbd_resource=r5 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd5 directory=/mnt/apbuild/astorage/packages fstype=ext3 \ op monitor interval=60s primitive drbd6 ocf:linbit:drbd \ params drbd_resource=r6 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8 ocf:linbit:drbd \ params drbd_resource=r8 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4 \ op monitor interval=60s primitive drbd9 ocf:linbit:drbd \ params drbd_resource=r9 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd9_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd9 directory=/exports fstype=ext4 \ op monitor interval=60s primitive nfs-common ocf:heartbeat:nfs-common \ op monitor interval=60s primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \ op monitor interval=60s primitive target ocf:heartbeat:target \ op monitor interval=60s group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common nfs-kernel-server astorage_ip astorage_16_ip target \ meta target-role=Started ms ma-ms-drbd10 drbd10 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd3 drbd3 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd4 drbd4 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd5 drbd5 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd6 drbd6 \ meta master-max=1
Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
sorry it should be group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd_fs_after_drbd inf: ma-ms-drbd5:promote ma-ms-drbd8:promote astorage:start 2013/6/6 emmanuel segura emi2f...@gmail.com group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start should be group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd_fs_after_drbd inf: ma-ms-drbd5:promote astorage:start 2013/6/6 Thomas Glanzmann tho...@glanzmann.de Hello, on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When putting multiple filesystems which depend on multiple drbd promotions, only the first drbd is promoted and the group never comes up. However when the promotions are not based on the individual filesystems but on the group or probably any single entity all drbds are promoted correctly. So to summarize: This only promotes the first drbd and the resource group never starts: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start # ~~ This works: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start # ~~ I would like to know if that is supposed to happen. If that is the case I would understand why this is the case. I assume it is a bug, but I'm not sure. Complete working config here: primitive astorage_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \ op monitor interval=60s primitive astorage1-fencing stonith:external/ipmi \ params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage2-fencing stonith:external/ipmi \ params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage_16_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \ op monitor interval=60s primitive drbd10 ocf:linbit:drbd \ params drbd_resource=r10 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd10_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd10 directory=/mnt/akvm/nfs fstype=ext4 \ op monitor interval=60s primitive drbd3 ocf:linbit:drbd \ params drbd_resource=r3 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd4 ocf:linbit:drbd \ params drbd_resource=r4 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5 ocf:linbit:drbd \ params drbd_resource=r5 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd5 directory=/mnt/apbuild/astorage/packages fstype=ext3 \ op monitor interval=60s primitive drbd6 ocf:linbit:drbd \ params drbd_resource=r6 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8 ocf:linbit:drbd \ params drbd_resource=r8 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4 \ op monitor interval=60s primitive drbd9 ocf:linbit:drbd \ params drbd_resource=r9 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd9_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd9 directory=/exports fstype=ext4 \ op monitor interval=60s primitive nfs-common ocf:heartbeat:nfs-common \ op monitor interval=60s primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \ op monitor interval=60s primitive target ocf:heartbeat:target \ op monitor interval=60s group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common nfs-kernel-server astorage_ip astorage_16_ip target \ meta target-role=Started ms ma-ms-drbd10 drbd10 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd3 drbd3 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd4 drbd4 \ meta master-max=1 master-node-max=1 clone-max=2 clone
Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds
Hello Thomas Sorry i can't give you any explain, because i don't see any sense in your config Sorry 2013/6/6 Thomas Glanzmann tho...@glanzmann.de Hello, on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When putting multiple filesystems which depend on multiple drbd promotions, only the first drbd is promoted and the group never comes up. However when the promotions are not based on the individual filesystems but on the group or probably any single entity all drbds are promoted correctly. So to summarize: This only promotes the first drbd and the resource group never starts: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start # ~~ This works: group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start # ~~ I would like to know if that is supposed to happen. If that is the case I would understand why this is the case. I assume it is a bug, but I'm not sure. Complete working config here: primitive astorage_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \ op monitor interval=60s primitive astorage1-fencing stonith:external/ipmi \ params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage2-fencing stonith:external/ipmi \ params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN passwd=secret \ op monitor interval=60s primitive astorage_16_ip ocf:heartbeat:IPaddr2 \ params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \ op monitor interval=60s primitive drbd10 ocf:linbit:drbd \ params drbd_resource=r10 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd10_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd10 directory=/mnt/akvm/nfs fstype=ext4 \ op monitor interval=60s primitive drbd3 ocf:linbit:drbd \ params drbd_resource=r3 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd4 ocf:linbit:drbd \ params drbd_resource=r4 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5 ocf:linbit:drbd \ params drbd_resource=r5 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd5_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd5 directory=/mnt/apbuild/astorage/packages fstype=ext3 \ op monitor interval=60s primitive drbd6 ocf:linbit:drbd \ params drbd_resource=r6 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8 ocf:linbit:drbd \ params drbd_resource=r8 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd8_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4 \ op monitor interval=60s primitive drbd9 ocf:linbit:drbd \ params drbd_resource=r9 \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive drbd9_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd9 directory=/exports fstype=ext4 \ op monitor interval=60s primitive nfs-common ocf:heartbeat:nfs-common \ op monitor interval=60s primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \ op monitor interval=60s primitive target ocf:heartbeat:target \ op monitor interval=60s group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common nfs-kernel-server astorage_ip astorage_16_ip target \ meta target-role=Started ms ma-ms-drbd10 drbd10 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd3 drbd3 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd4 drbd4 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd5 drbd5 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd6 drbd6 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd8 drbd8 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started ms ma-ms-drbd9 drbd9 \ meta master-max=1
Re: [Linux-HA] Network failover and communication channel survival
maybe you can use openvswitch 2013/4/30 Lang, David david_l...@intuit.com I've thought about this for a few years, but have not yet implemented it. What I would look at is setting up a new virtual network that trunks your two physical networks together and you can then use the IP on that trunk for your communication. David Lang Richard Comblen richard.comb...@kreios.lu wrote: Hi all, I have a two node setup with replicated PostgreSQL DB (master/slave setup). Focus is on keeping the system up and running, not on capacity. All good, that works fine. Now, a new requirement shows up: the two nodes should be connected using two physically separated networks, and should survive failure of one of the two networks. The two nodes communicate together for PostgreSQL replication. Initially, the slave will communicate with the master on network 1, and if network 1 fails, it should switch to network 2. Obviously, I cannot use a virtual-ip over two different networks. What would be the closest solution in term of features ? I was thinking about having a resource agent managing a port forward rule on slave node, something like localhost:5433 = master_ip_on_network1:5432, that would switch to localhost:5433 = master_ip_on_network2:5432, so that it would be transparent for the replication tool. Do you know if such a resource agent is already implemented somewhere ? Do you have remarks, comments about such a setup ? Do you have suggestion on a better way to achieve these requirements ? Thanks, Richard ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Is CLVM really needed in an active/passive cluster?
Hello Angel In this thread http://comments.gmane.org/gmane.linux.redhat.release.rhel5/6395 you can find the answer to your question Thanks 2013/4/22 Angel L. Mateo ama...@um.es Hello, I'm deploying a clustered pop/imap server with mailboxes stored in a SAN connected with fibre channel. The problem I have is that I have firstly configured the cluster with CLVM, but with this I can't create snapshots of my volumes, which is required for backups. But is this CLVM really necessary? Or it is enough to configure LVM with fencing and stonith? -- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 86337 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Fwd: stonith with sbd not working
create a partition on /dev/sdd and you that 2013/4/9 Fredrik Hudner fredrik.hud...@gmail.com Hi, I have a (for now) two node HA cluster with sbd as stonith mechanism. I have followed the installation and configuration of sbd from http://www.linux-ha.org/wiki/SBD_Fencing. For one reason or another stonith won't start and messages log says at one point: stonith-ng[20383]: notice: stonith_device_action: Device stonith_sbd not found. stonith-ng[30234]: info: stonith_command: Processed st_execute from lrmd: rc=-12 stonith-ng[30234]: info: stonith_device_register: Added 'stonith_sbd' to the device list (1 active devices) stonith-ng[30234]: info: stonith_command: Processed st_device_register from lrmd: rc=0 stonith-ng[30234]: info: stonith_command: Processed st_execute from lrmd: rc=-1 stonith-ng[30234]: notice: log_operation: Operation 'monitor' [30502] for device 'stonith_sbd' returned: -2 stonith-ng[30234]: info: stonith_device_remove: Removed 'stonith_sbd' from the device list (0 active devices) Maybe it doesn't find the sbd device /dev/sdd ? From the output of my logs, you can see that both machines see each others sbd device but stonith doesn't seem to recognize the device. Question is why Attached are all possible logs and config files from drbd, sbd, pacemaker. Corosync I can send if you need me too, but the post became to big with that included Kind regards /Fredrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Fwd: stonith with sbd not working
Sorry create a partition on /dev/sdd and you use that 2013/4/9 emmanuel segura emi2f...@gmail.com create a partition on /dev/sdd and you that 2013/4/9 Fredrik Hudner fredrik.hud...@gmail.com Hi, I have a (for now) two node HA cluster with sbd as stonith mechanism. I have followed the installation and configuration of sbd from http://www.linux-ha.org/wiki/SBD_Fencing. For one reason or another stonith won't start and messages log says at one point: stonith-ng[20383]: notice: stonith_device_action: Device stonith_sbd not found. stonith-ng[30234]: info: stonith_command: Processed st_execute from lrmd: rc=-12 stonith-ng[30234]: info: stonith_device_register: Added 'stonith_sbd' to the device list (1 active devices) stonith-ng[30234]: info: stonith_command: Processed st_device_register from lrmd: rc=0 stonith-ng[30234]: info: stonith_command: Processed st_execute from lrmd: rc=-1 stonith-ng[30234]: notice: log_operation: Operation 'monitor' [30502] for device 'stonith_sbd' returned: -2 stonith-ng[30234]: info: stonith_device_remove: Removed 'stonith_sbd' from the device list (0 active devices) Maybe it doesn't find the sbd device /dev/sdd ? From the output of my logs, you can see that both machines see each others sbd device but stonith doesn't seem to recognize the device. Question is why Attached are all possible logs and config files from drbd, sbd, pacemaker. Corosync I can send if you need me too, but the post became to big with that included Kind regards /Fredrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Q: corosync/TOTEM retransmit list
Try look here http://www.hastexo.com/resources/hints-and-kinks/whats-totem-retransmit-list-all-about-corosync 2013/4/3 Ulrich Windl ulrich.wi...@rz.uni-regensburg.de Hi! I have a simple question: Is it possible that DLM or OCFS2 causes corosync/TOTEM retransmit messages? I have the feeling that whenever OCFS2 is busy, corosync/TOTEM sends out retransmit lists like this: [...] Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6940d Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69410 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69413 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69415 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69417 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69419 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6941b Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6941d Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69420 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69422 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69424 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69426 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69428 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6942a Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6942c Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6942e Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69430 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69432 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69436 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69439 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6943c Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6943e Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69440 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69442 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69444 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69446 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69448 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6944a Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6944c Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 6944e Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69450 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69452 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69454 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69456 Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Marking ringid 1 interface 10.2.2.1 FAULTY Apr 3 11:30:01 h01 corosync[4310]: [TOTEM ] Retransmit List: 69458 Apr 3 11:30:02 h01 corosync[4310]: [TOTEM ] Automatically recovered ring 1 Apr 3 11:30:02 h01 corosync[4310]: [TOTEM ] Automatically recovered ring 1 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 69484 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 69486 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 69486 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 69486 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 69487 Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948a Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b 6948d Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b Apr 3 11:30:07 h01 corosync[4310]: [TOTEM ] Retransmit List: 6948b Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 69492 Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 69494 Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 69496 Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 69499 Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 6949b Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 6949d Apr 3 11:30:09 h01 corosync[4310]: [TOTEM ] Retransmit List: 6949f Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694a3 Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694a5 Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694a7 Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694a9 Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694ab Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694ad Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694af Apr 3 11:30:14 h01 corosync[4310]: [TOTEM ] Retransmit List: 694b1 Apr 3
Re: [Linux-HA] Heartbeat IPv6addr OCF
Hello Nick Try to use nic=eth0 instead of nic=eth0:3 thanks 2013/3/24 Nick Walke tubaguy50...@gmail.com Thanks for the tip, however, it did not work. That's actually a /116. So I put in 2600:3c00::0034:c007/116 and am getting the same error. I requested that it restart the resource as well, just to make sure it wasn't the previous error. Nick On Sun, Mar 24, 2013 at 3:55 AM, Thomas Glanzmann tho...@glanzmann.de wrote: Hello, ipv6addr=2600:3c00::0034:c007 from the manpage of ocf_heartbeat_IPv6addr it looks like that you have to specify the netmask so try: ipv6addr=2600:3c00::0034:c007/64 assuiming that you're in a /64. Cheers, Thomas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem promoting Slave to Master
Hello Fedrik Why you have a clone of cl_exportfs_root and you have ext4 filesystem, and i think this order is not correct order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start order o_root_before_nfs inf: cl_exportfs_root g_nfs:start I think like that you try to start g_nfs twice 2013/3/14 Fredrik Hudner fredrik.hud...@evry.com Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don't understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn't look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running The filesystems mount correctly on the master at this stage and can be written to. When I stop the services on the master node for it to failover, it doesn't work.. Looses cluster-ip connectivity Corosync.log from master after I stopped pacemaker on master node : see attached file Additional files (attached): crm-configure show Corosync.conf Global_common.conf I'm not sure how to proceed to get it up in a fair state now So if anyone could help me it would be much appreciated Kind regards /Fredrik Hudner ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Using a Ping Daemon (or Something Better) to Prevent Split Brain
I don't know if ping is rigth for your case, try to look here http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html 2013/1/31 Robinson, Eric eric.robin...@psmnv.com We have this configuration: NodeA is located in DataCenterA. NodeB is located in (geographically separate) DataCenterB. DataCenterA is connected to DataCenterB through 4 redundant gigabit links (two physically separate Corosync rings). Both nodes reach the Internet through (geographically separate) DataCenterC. Is it possible to prevent cluster partition (or drbd split brain) if the links between DataCenterA and DataCenterB go down, but at least one node can still communicate with DataCenterC? Note that we have no equipment at DataCenterC, but we can ping stuff in it and through it. Ideally, I would like to prevent a secondary node from going primary if it cannot communicate with DataCenterC. -- Eric Robinson Disclaimer - January 30, 2013 This email and any files transmitted with it are confidential and intended solely for linux-ha@lists.linux-ha.org. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] resources running on both node
1: check if the services are configured to start at boot time 2:without info nobody can help you 2012/7/21 Chirag Vaishnav chirag.vaish...@saicare.com Hi, We are HA between two nodes, everything is configured as per standard example file (using haresources) and everything works well generally, but some time we face issue that HA is up and running on both the nodes with resources (IP) allocated on both node. Ideally in such case one of the node should leave the resources but it doesn't happen. Any suggestions would be helpful. Thanks Regards, Chirag Vaishnav Sai Infosystem (India) Ltd. Corp. Office: Sai Care Super Plaza, Sandesh press road, Vastrapur Ahmedabad . 380054. Gujarat, India Phone: +91-79-30110400 Url: www.saicare.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and automatic sync
are you using ext3 for drbd active/active? UM 2012/8/3 Elvis Altherr elvis.alth...@gmail.com Hello together On my gentoo servers (2 Node Cluster with kernel 3.x) i use heartbeat 3.0.5 and DRBD 8.4.0 for block replication between the two machines which served apache, mysql and samba fileservices Everything works fine, except the automatic sync between the two drives wich are both primarys What did i wrong? conf files see below drbd.conf resource r0 { # protocol to use; C is the the safest variant net { allow-two-primaries; } protocol C; startup { become-primary-on both; #timeout (in seconds) for the connection on startup wfc-timeout 90; # timeout (in seconds) for the connection on startup #after detection of data inconsistencies (degraded mode) degr-wfc-timeout 120; } syncer { # maximum bandwidth to use for this resource rate 100M; } on mail2 { ### options for master-server ### # name of the allocated blockdevice device /dev/drbd0; # underlying blockdevice disk /dev/sdb1; #address and port to use for the synchronisation # here we use the heartbeat network address10.0.0.1:7788; # where to store DRBD metadata; here it's on the underlying device itself meta-disk internal; } on disthost3 { device /dev/drbd1; disk /dev/sda6; address 10.0.0.2:7788; meta-disk internal; } haresoures file for heartbeat mail2 10.0.0.3 drbddisk::r0 Filesystem::/dev/drbd0::/drfs::ext3 apache2 mysql bind samba ha.cf # Logging debug 1 use_logd true logfacility daemon # Misc Options traditional_compression off compression bz2 coredumps true auto_failback on # Communications udpport 694 #ucast eth1 10.0.0.1 bcast eth1 #autojoin any # Thresholds (in seconds) keepalive 2 warntime5 deadtime15 initdead60 crm no nodemail2 nodedisthost3 ~ thanks for your help -- Freundliche Grüsse Elvis Altherr Brauerstrasse 83a 9016 St. Gallen 071 280 13 79 (Privat) elvis.alth...@gmail.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD and automatic sync
i know the drbd primary to primary it's for use ocfs/gfs, so for have the filesystem read write on both nodes, why you still using heartbeat 1.X 2012/8/3 Elvis Altherr elvis.alth...@gmail.com Am 03.08.2012 09:32, schrieb emmanuel segura: are you using ext3 for drbd active/active? UM 2012/8/3 Elvis Altherr elvis.alth...@gmail.com Hello together On my gentoo servers (2 Node Cluster with kernel 3.x) i use heartbeat 3.0.5 and DRBD 8.4.0 for block replication between the two machines which served apache, mysql and samba fileservices Everything works fine, except the automatic sync between the two drives wich are both primarys What did i wrong? conf files see below drbd.conf resource r0 { # protocol to use; C is the the safest variant net { allow-two-primaries; } protocol C; startup { become-primary-on both; #timeout (in seconds) for the connection on startup wfc-timeout 90; # timeout (in seconds) for the connection on startup #after detection of data inconsistencies (degraded mode) degr-wfc-timeout 120; } syncer { # maximum bandwidth to use for this resource rate 100M; } on mail2 { ### options for master-server ### # name of the allocated blockdevice device /dev/drbd0; # underlying blockdevice disk /dev/sdb1; #address and port to use for the synchronisation # here we use the heartbeat network address10.0.0.1:7788; # where to store DRBD metadata; here it's on the underlying device itself meta-disk internal; } on disthost3 { device /dev/drbd1; disk /dev/sda6; address 10.0.0.2:7788; meta-disk internal; } haresoures file for heartbeat mail2 10.0.0.3 drbddisk::r0 Filesystem::/dev/drbd0::/drfs::ext3 apache2 mysql bind samba ha.cf # Logging debug 1 use_logd true logfacility daemon # Misc Options traditional_compression off compression bz2 coredumps true auto_failback on # Communications udpport 694 #ucast eth1 10.0.0.1 bcast eth1 #autojoin any # Thresholds (in seconds) keepalive 2 warntime5 deadtime15 initdead60 crm no nodemail2 nodedisthost3 ~ thanks for your help -- Freundliche Grüsse Elvis Altherr Brauerstrasse 83a 9016 St. Gallen 071 280 13 79 (Privat) elvis.alth...@gmail.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems yes.. woud i better use GFS2 or OFCS (which both dosen't work under kernel 3.x) ? Or which is the best file system porpouse for my case? -- Freundliche Grüsse Elvis Altherr Brauerstrasse 83a 9016 St. Gallen 071 280 13 79 (Privat) elvis.alth...@gmail.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] mount.ocfs2 in D state
Do you have a stonith configured? 2012/7/2 EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com Hi, when a split brain (drbd) happens mount.ocfs2 remains hanging unkillable in D-state. rt-lxcl9a:~ # ps aux | grep ocf root 347 0.0 0.0 10468 740 ?D10:25 0:00 /sbin/mount.ocfs2 /dev/drbd_r0 /SHARED -o rw root 349 0.0 0.0 0 0 ?S10:25 0:00 [ocfs2dc] root 5906 0.0 0.0 11552 1796 ?S10:36 0:00 /bin/sh /usr/lib/ocf/resource.d//heartbeat/Filesystem stop root 32715 0.0 0.0 0 0 ?S 10:25 0:00 [ocfs2_wq] root 32717 0.0 0.0 90776 2120 ?Ss 10:25 0:00 /usr/sbin/ocfs2_controld.pcmk As a consequence I do not know how to resolve the split brain situation as I cannot demote anything anymore. Is this a known bug? Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Franz Fehrenbach; Geschäftsführung: Volkmar Denner, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Dirk Hoheisel, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] mount.ocfs2 in D state
remove the standby on node node rt-lxcl9a 2012/7/2 EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) external.martin.kon...@de.bosch.com Hi, Do you have a stonith configured? Yes. Though a hanging mount does not cause stonith to become activated. node rt-lxcl9a \ attributes standby=on node rt-lxcl9b \ attributes standby=off primitive dlm ocf:pacemaker:controld \ op monitor interval=60 timeout=60 \ meta target-role=Stopped primitive ip-rt-lxlr9a ocf:heartbeat:IPaddr \ params ip=10.13.132.94 cidr_netmask=255.255.252.0 \ op monitor interval=5s timeout=20s depth=0 \ meta target-role=Started primitive ip-rt-lxlr9b ocf:heartbeat:IPaddr \ params ip=10.13.132.95 cidr_netmask=255.255.252.0 \ op monitor interval=5s timeout=20s depth=0 \ meta target-role=Started primitive o2cb ocf:ocfs2:o2cb \ op monitor interval=60 timeout=60 \ meta target-role=Stopped primitive resDRBD ocf:linbit:drbd \ params drbd_resource=r0 \ operations $id=resDRBD-operations \ op monitor interval=20 role=Master timeout=20 \ op monitor interval=30 role=Slave timeout=20 \ meta target-role=Stopped primitive resource-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd_r0 directory=/SHARED fstype=ocfs2 \ op monitor interval=120s \ meta target-role=Stopped primitive stonith-ilo-rt-lxcl9ar stonith:external/ipmi \ params hostname=rt-lxcl9a ipaddr=10.13.172.85 userid=stonith passwd=stonithstonith passwd_method=param interface=lanplus pcmk_host_check=static-list pcmk_host_list=rt-lxcl9a \ meta target-role=Started primitive stonith-ilo-rt-lxcl9br stonith:external/ipmi \ params hostname=rt-lxcl9b ipaddr=10.13.172.93 userid=stonith passwd=stonithstonith passwd_method=param interface=lanplus pcmk_host_check=static-list pcmk_host_list=rt-lxcl9b \ meta target-role=Started ms msDRBD resDRBD \ meta resource-stickines=100 notify=true master-max=2 interleave=true target-role=Started clone clone-dlm dlm \ meta globally-unique=false interleave=true target-role=Started clone clone-fs resource-fs \ meta interleave=true ordered=true target-role=Started clone clone-ocb o2cb \ meta globally-unique=false interleave=true target-role=Stopped location location-stonith-ilo-rt-lxcl9ar stonith-ilo-rt-lxcl9ar -inf: rt-lxcl9a location location-stonith-ilo-rt-lxcl9br stonith-ilo-rt-lxcl9br -inf: rt-lxcl9b colocation colocation-dlm-drbd inf: clone-dlm msDRBD:Master colocation colocation-fs-o2cb inf: clone-fs clone-ocb colocation colocation-ocation-dlm inf: clone-ocb clone-dlm order order-dlm-o2cb 0: clone-dlm clone-ocb order order-drbd-dlm 0: msDRBD:promote clone-dlm:start order order-o2cb-fs 0: clone-ocb clone-fs property $id=cib-bootstrap-options \ stonith-enabled=true \ no-quorum-policy=ignore \ placement-strategy=balanced \ dc-version=1.1.6-b988976485d15cb702c9307df55512d323831a5e \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ last-lrm-refresh=1341218156 \ stonith-timeout=30s \ maintenance-mode=true rsc_defaults $id=rsc-options \ resource-stickiness=200 \ migration-threshold=3 op_defaults $id=op-options \ timeout=600 \ record-pending=true Yours, -- martin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop
First of all the parameter 201 it must be diferent for every resource 2012/6/19 Martin Marji Cermak cerm...@gmail.com Hello guys, I have 3 questions if you please. I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes plus 1 quorum node, in semi Active-Active configuration. By semi, I mean that both NFS nodes are active and each of them is under normal circumstances exclusively responsible for one (out of two) Volume Group - using the ocf:heartbeat:LVM RA. Each LVM volume group lives on a dedicated multipath iscsi device, exported from a shared SAN. I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need to make it available for 2 separate /21 networks as read-only, and for 3 different servers as read-write. I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use the ocf:heartbeat:exportfs RA 5 times. The configuration (only IP addresses changed) is here: http://pastebin.com/eHkgUv64 1) is there a way how to export this directory 5 times without defining 5 ocf:heartbeat:exportfs primitives? It's a lot of duplications... I search all the forums and I fear the ocf:heartbeat:exportfs simply supports only one host / network range. But maybe someone has been working on a patch? 2) while using the ocf:heartbeat:exportfs 5 times for the same directory, do I have to use the _same_ FSID (201 in my config) for all these 5 primitives (as Im exporting the _same_ filesystem / directory)? I'm getting this warning when doing so WARNING: Resources p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs violate uniqueness for parameter fsid: 201 Do you still want to commit? 3) wait_for_leasetime_on_stop - I believe this must be set to true when exporting NFSv4 with ocf:heartbeat:exportfs. http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html My 5 exportfs primitives reside in the same group: group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt p_ip02 \ meta resource-stickiness=101 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a failover of the NFS group from one NFS node to the second node would take more than 50 seconds, as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10 seconds 5 times. Is there any way of making them fail over / sleeping in parallel, instead of sequential? I workarounded this by setting wait_for_leasetime_on_stop=true for only one of these (which I believe is safe and does the job it is expected to do - please correct me if I'm wrong). Thank you for your valuable comments. My Pacemaker configuration: http://pastebin.com/eHkgUv64 [root@irvine ~]# facter | egrep 'lsbdistid|lsbdistrelease' lsbdistid = CentOS lsbdistrelease = 6.2 [root@irvine ~]# rpm -qa | egrep 'pacemaker|corosync|agents' corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-cli-1.1.6-3.el6.x86_64 pacemaker-libs-1.1.6-3.el6.x86_64 corosynclib-1.4.1-4.el6_2.2.x86_64 pacemaker-cluster-libs-1.1.6-3.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 fence-agents-3.1.5-10.el6_2.2.x86_64 resource-agents-3.9.2-7.el6.x86_64 with /usr/lib/ocf/resource.d/heartbeat/exportfs updated by hand from: https://github.com/ClusterLabs/resource-agents/commits/master/heartbeat/exportfs Thank you very much Marji Cermak ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Active/Active Cluster
Why you are using cman corosync together? I think you should use cman+pacemaker or corosync+pacemaker 2012/6/9 Yount, William D yount.will...@menloworldwide.com I have two servers which are both Dell 990's. Each server has two 1tb hard drives configured in RAID0. I have installed CentOS on both and they have the same partition sizes. I am using /dev/KNTCLFS00X/Storage as a drbd partition and attaching it to /dev/drbd0. DRBD syncing appears to working fine. I am trying to setup an Active/Active cluster. I have set up CMAN. I want to use this cluster just for NFS storage. I want to have these services running on both nodes at the same time: * IP Address * DRBD * Filesystem (gcfs2) Through a combination of official documentation and LCMC, I have this setup. However, I am getting this: Last updated: Fri Jun 8 23:11:58 2012 Last change: Fri Jun 8 23:11:37 2012 via crmd on KNTCLFS002 Stack: cman Current DC: KNTCLFS002 - partition with quorum Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558 2 Nodes configured, 2 expected votes 6 Resources configured. Online: [ KNTCLFS001 KNTCLFS002 ] Clone Set: cl_IPaddr2_1 [res_IPaddr2_1] Started: [ KNTCLFS002 KNTCLFS001 ] Master/Slave Set: ms_drbd_2 [res_drbd_2] Masters: [ KNTCLFS002 ] Stopped: [ res_drbd_2:0 ] Clone Set: cl_Filesystem_1 [res_Filesystem_1] (unique) res_Filesystem_1:0 (ocf::heartbeat:Filesystem):Stopped res_Filesystem_1:1 (ocf::heartbeat:Filesystem):Started KNTCLFS002 Failed actions: res_Filesystem_1:0_start_0 (node=KNTCLFS002, call=42, rc=1, status=complete): unknown error res_Filesystem_1:0_start_0 (node=KNTCLFS001, call=87, rc=1, status=complete): unknown error I have attached my nfs.res, cluster.conf and corosync.conf files. Please let me know if I can provide any other information to help resolve this. Thanks, William Yount | Systems Analyst | Menlo Worldwide | Cell: 901-654-9933 Safety | Leadership | Integrity | Commitment | Excellence Please consider the environment before printing this e-mail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with nfs and exportfs failover
Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i think this primitive was stoped befoure exportfs-admin ocf:heartbeat:exportfs And if i rember the lsb:nfs-kernel-server and exportfs agent does the same thing the first use the os scripts and the second the cluster agents Il giorno 14 aprile 2012 01:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 4/13/12 7:18 PM, William Seligman wrote: On 4/13/12 6:42 PM, Seth Galitzer wrote: In attempting to build a nice clean config, I'm now in a state where exportfs never starts. It always times out and errors. crm config show is pasted here: http://pastebin.com/cKFFL0Xf syslog after an attempted restart here: http://pastebin.com/CHdF21M4 Only IPs have been edited. It's clear that your exportfs resource is timing out for the admin resource. I'm no expert, but here are some stupid exportfs tricks to try: - Check your /etc/exports file (or whatever the equivalent is in Debian; man exportfs will tell you) on both nodes. Make sure you're not already exporting the directory when the NFS server starts. - Take out the exportfs-admin resource. Then try doing things manually: # exportfs x.x.x.0/24:/exports/admin Assuming that works, then look at the output of just # exportfs The clientspec reported by exportfs has to match the clientspec you put into the resource exactly. If exportfs is canonicalizing or reporting the clientspec differently, the exportfs monitor won't work. If this is the case, change the clientspec parameter in exportfs-admin to match. If the output of exportfs has any results that span more than one line, then you've got the problem that the patch I referred you to (quoted below) is supposed to fix. You'll have to apply the patch to your exportfs resource. Wait a second; I completely forgot about this thread that I started: http://www.gossamer-threads.com/lists/linuxha/users/78585 The solution turned out to be to remove the .rmtab files from the directories I was exporting, deleting touching /var/lib/nfs/rmtab (you'll have to look up the Debian location), and adding rmtab_backup=none to all my exportfs resources. Hopefully there's a solution for you in there somewhere! On 04/13/2012 01:51 PM, William Seligman wrote: On 4/13/12 12:38 PM, Seth Galitzer wrote: I'm working through this howto doc: http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf and am stuck at section 4.4. When I put the primary node in standby, it seems that NFS never releases the export, so it can't shut down, and thus can't get started on the secondary node. Everything up to that point in the doc works fine and fails over correctly. But once I add the exportfs resource, it fails. I'm running this on debian wheezy with the included standard packages, not custom. Any suggestions? I'd be happy to post configs and logs if requested. Yes, please post the output of crm configure show, the output of exportfs while the resource is running properly, and the relevant sections of your log file. I suggest using pastebin.com, to keep mailboxes filling up with walls of text. In case you haven't seen this thread already, you might want to take a look: http://www.gossamer-threads.com/lists/linuxha/dev/77166 And the resulting commit: https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae (Links courtesy of Lars Ellenberg.) The problem and patch discussed in those links doesn't quite match what you describe. I mention it because I had to patch my exportfs resource (in /usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get it to work properly in my setup. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with stonith RA using external/ipmi over lan or lanplus interface
If i rember well, for use Ilo 3 card you sould use the cluster agent ipmilan Il giorno 11 aprile 2012 23:00, Pham, Tom tom.p...@viasat.com ha scritto: Hi everyone, I try to test a 2 nodes cluster with stonith resource using external/ipmi ( I tried external/riloe first but it does not seem to work) My system has HP Proliant BL460c G7 with iLO 3 card Firmware 1.15 SUSE 11 Corosync version 1.2.7 ; Pacmaker 1.0.9 When I use the interface lan or lanplus, It failed to start the stonith resource. I get the error below external/ipmi[12173]: [12184]: ERROR: error executing ipmitool: Error: Unable to establish IPMI v2 / RMCP+ session Unable to get Chassis Power Status However, when I used the interface = open instead lan/lanplus ,it started the stonith resource fine. When I tried to kill -9 corosync in node1, I expect it will reboot node1 and started all resource on node2. But it reboot node1. Someone mentioned that open interface is a local interface and only allows to fence itself. Anyone knows why the lan/lanplus does not work? Thanks Tom Pham ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
William But i would like to know if you have a lvm resource in your pacemaker configuration Remember clvmd it's not for active di vg or lv it's for propagate the lvm meta data on all node of the cluster Il giorno 26 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/26/12 4:28 PM, emmanuel segura wrote: Sorry Willian i can't post my config now because i'm at home now not in my job I think it's no a problem if clvm start before drbd, because clvm not needed and devices to start This it's the point, i hope to be clear The introduction of pacemaker in redhat cluster was thinked for replace rgmanager not whole cluster stack and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Sorry for my bad english :-) i can from a spanish country and all days i speak Italian I'm sorry that I don't speak more languages! You're the one who's helping me; it's my task to learn and understand. Certainly your English is better than my French or Russian. Il giorno 26 marzo 2012 22:04, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/26/12 3:48 PM, emmanuel segura wrote: I know it's normal fence_node doesn't work because the request of fence must be redirect to pacemaker stonith I think call the cluster agents with rgmanager it's really ugly thing, i never seen a cluster like this == If I understand Pacemaker Explained http://bit.ly/GR5WEY and how I'd invoke clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be invoked by either HA resource manager is exactly the same: /etc/init.d/clvmd. == clvm doesn't need to be called from rgmanger in the cluster configuration this the boot sequence of redhat daemons 1:cman, 2:clvm, 3:rgmanager and if you don't wanna use rgmanager you just replace rgmanager I'm sorry, but I don't think I understand what you're suggesting. Do you suggest that I start clvmd at boot? That won't work; clvmd won't see the volume groups on drbd until drbd is started and promoted to primary. May I ask you to post your own cluster.conf on pastebin.com so I can see how you do it? Along with crm configure show if that's relevant for your cluster? Il giorno 26 marzo 2012 19:21, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/24/12 5:40 PM, emmanuel segura wrote: I think it's better you use clvmd with cman I don't now why you use the lsb script of clvm On Redhat clvmd need of cman and you try to running with pacemaker, i not sure this is the problem but this type of configuration it's so strange I made it a virtual cluster with kvm and i not foud a problems While I appreciate the advice, it's not immediately clear that trying to eliminate pacemaker would do me any good. Perhaps someone can demonstrate the error in my reasoning: If I understand Pacemaker Explained http://bit.ly/GR5WEY and how I'd invoke clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be invoked by either HA resource manager is exactly the same: /etc/init.d/clvmd. If I tried to use cman instead of pacemaker, I'd be cutting myself off from the pacemaker features that cman/rgmanager does not yet have available, such as pacemaker's symlink, exportfs, and clonable IPaddr2 resources. I recognize I've got a strange problem. Given that fence_node doesn't work but stonith_admin does, I strongly suspect that the problem is caused by the behavior of my fencing agent, not the use of pacemaker versus rgmanager, nor by how clvmd is being started. Il giorno 24 marzo 2012 13:09, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/24/12 4:47 AM, emmanuel segura wrote: How do you configure clvmd? with cman or with pacemaker? Pacemaker. Here's the output of 'crm configure show': http://pastebin.com/426CdVwN Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED
William :-) So now your cluster it's OK? Il giorno 27 marzo 2012 00:33, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/26/12 5:31 PM, William Seligman wrote: On 3/26/12 5:17 PM, William Seligman wrote: On 3/26/12 4:28 PM, emmanuel segura wrote: and i suggest you to start clvmd at boot time chkconfig clvmd on I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get: Mounting GFS2 filesystem (/usr/nevis): invalid device path /dev/mapper/ADMIN-usr [FAILED] ... and so on, because the ADMIN volume group was never loaded by clvmd. Without a vgscan in there somewhere, the system can't see the volume groups on the drbd resource. Wait a second... there's an ocf:heartbeat:LVM resource! Testing... Emannuel, you did it! For the sake of future searches, and possibly future documentation, let me start with my original description of the problem: I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 The problem is that clvmd on the main node will hang if there's a substantive period of time during which the other node returns running cman but not clvmd. I never tracked down why this happens, but there's a practical solution: minimize any interval for which that would be true. To ensure this, take clvmd outside the resource manager's control: chkconfig cman on chkconfig clvmd on chkconfig pacemaker on On RHEL6.2, these services will be started in the above order; clvmd will start within a few seconds after cman. Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of crm configure show http://pastebin.com/f9D4Ui5Z. The key lines from the latter are: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin primitive AdminLvm ocf:heartbeat:LVM \ params volgrpname=ADMIN \ op monitor interval=30 timeout=100 depth=0 primitive Gfs2 lsb:gfs2 group VolumeGroup AdminLvm Gfs2 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true interleave=true clone VolumeClone VolumeGroup \ meta interleave=true colocation Volume_With_Admin inf: VolumeClone AdminClone:Master order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start What I learned: If one is going to extend the example in Clusters From Scratch to include logical volumes, one must start clvmd at boot time, and include any volume groups in ocf:heartbeat:LVM resources that start before gfs2. Note the long timeout on the ocf:heartbeat:LVM resource. This is a good idea because, during the boot of the crashed node, there'll still be an interval of a few seconds when cman will be running but clvmd won't be. During my tests, the LVM monitor would fail if it checked during that interval with a timeout that was shorter than it took clvmd to start on the crashed node. This was annoying; all resources dependent on AdminLvm would be stopped until AdminLvm recovered (a few more seconds). Increasing the timeout avoids this. It also means that during any recovery procedure on the crashed node for which I turn off all the services, I have to minimize the interval between the start of cman and clvmd if I've turned off services at boot; e.g., service drbd start # ... and fix any split-brain problems or whatever service cman start; service clvmd start # put on one line service pacemaker start I thank everyone on this list who was patient with me as I pounded on this problem for two weeks! -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
How do you configure clvmd? with cman or with pacemaker? Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
I think it's better you use clvmd with cman I don't now why you use the lsb script of clvm On Redhat clvmd need of cman and you try to running with pacemaker, i not sure this is the problem but this type of configuration it's so strange I made it a virtual cluster with kvm and i not foud a problems Il giorno 24 marzo 2012 13:09, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/24/12 4:47 AM, emmanuel segura wrote: How do you configure clvmd? with cman or with pacemaker? Pacemaker. Here's the output of 'crm configure show': http://pastebin.com/426CdVwN Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/23/12 5:03 PM, emmanuel segura wrote: Sorry but i would to know if can show me your /etc/cluster/cluster.conf Here it is: http://pastebin.com/GUr0CEgZ Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong return codes from my fencing agent: http://www.gossamer-threads.com/lists/linuxha/users/78572 I cleaned up my fencing agent, making sure its return code matched those returned by other agents in /usr/sbin/fence_, and allowing for some delay issues in reading the UPS status. But... After that, I'll look at another suggestion with lvm.conf: http://www.gossamer-threads.com/lists/linuxha/users/78796#78796 Then I'll try DRBD 8.4.1. Hopefully one of these is the source of the issue. Failure on all three counts. May I suggest you double check the permissions on your fence peer script? I suspect you may simply have forgotten the chmod +x . Test with drbdadm fence-peer minor-0 from the command line. I still haven't solved the problem, but this advice has gotten me further than before. First, Lars was correct: I did not have execute permissions set on my fence peer scripts. (D'oh!) I turned them on, but that did not change anything: cman+clvmd still hung on the vgdisplay command if I crashed the peer node. I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried Lars' suggested command. I didn't save the response for this message (d'oh again!) but it said that the fence-peer script had failed. Hmm. The peer was definitely shutting down, so my fencing script is working. I went over it, comparing the return codes to those of the existing scripts, and made some changes. Here's my current script: http://pastebin.com/nUnYVcBK. Up until now my fence-peer scripts had either been Lon Hohberger's obliterate-peer.sh or Digimer's rhcs_fence. I decided to try stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the first two scripts, which fence using fence_node, the latter script just calls stonith_admin. When I tried the stonith_admin-fence-peer.sh script, it worked: # drbdadm fence-peer minor-0 stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced peer orestes-corosync.nevis.columbia.edu. Power was cut on the peer, the remaining node stayed up. Then I brought up the peer with: stonith_admin -U orestes-corosync.nevis.columbia.edu BUT: When the restored peer came up and started to run cman, the clvmd hung on the main node again. After cycling through some more tests, I found that if I brought down the peer with drbdadm, then brought up with the peer with no HA services, then started drbd and then cman, the cluster remained intact. If I crashed the peer, the scheme in the previous paragraph didn't work. I bring up drbd, check that the disks are both UpToDate, then bring up cman. At that point the vgdisplay on the main node takes so long to run
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William Sorry but i would to know if can show me your /etc/cluster/cluster.conf Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/22/12 2:43 PM, William Seligman wrote: On 3/20/12 4:55 PM, Lars Ellenberg wrote: On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote: On 3/16/12 12:12 PM, William Seligman wrote: On 3/16/12 7:02 AM, Andreas Kurz wrote: On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f This is the output of drbdadm dump admin: http://pastebin.com/kTxvHCtx So I've got resource-and-stonith. I gather from an earlier thread that obliterate-peer.sh is more-or-less equivalent in functionality with stonith_admin_fence_peer.sh: http://www.gossamer-threads.com/lists/linuxha/users/78504#78504 At the moment I'm pursuing the possibility that I'm returning the wrong
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William for the lvm hang you can use this in your /etc/lvm/lvm.conf ignore_suspended_devices = 1 because i seen in the lvm log, === and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0 === Il giorno 15 marzo 2012 23:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 If I look at /proc/drbd if I bring down one node gracefully (crm node standby), I get this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r- ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Could it be that drbd can't respond to certain requests from lvm if the state of the peer is DUnknown instead of Outdated? Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello Willian The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William Ho did you created your volume group give me the output of vgs command when the cluster it's up Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html . Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html . Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Ok William we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show like that more later i can try to look if i found some fix Thanks for taking a look. My cluster.conf: http://pastebin.com/w5XNYyAX crm configure show: http://pastebin.com/atVkXjkn Before you spend a lot of time on the second file, remember that clvmd will hang whether or not I'm running pacemaker. Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:15 PM, emmanuel segura wrote: Ho did you created your volume group pvcreate /dev/drbd0 vgcreate -c y ADMIN /dev/drbd0 lvcreate -L 200G -n usr ADMIN # ... and so on # Nevis-HA is the cluster name I used in cluster.conf mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so on give me the output of vgs command when the cluster it's up Here it is: Logging initialised at Thu Mar 15 12:40:39 2012 Set umask from 0022 to 0077 Finding all volume groups Finding volume group ROOT Finding volume group ADMIN VG#PV #LV #SN Attr VSize VFree ADMIN 1 5 0 wz--nc 2.61t 765.79g ROOT1 2 0 wz--n- 117.16g 0 Wiping internal VG cache I assume the c in the ADMIN attributes means that clustering is turned on? Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 11:50 AM, emmanuel segura wrote: yes william Now try clvmd -d and see what happen locking_type = 3 it's lvm cluster lock type Since you asked for confirmation, here it is: the output of 'clvmd -d' just now. http://pastebin.com/bne8piEw. I crashed the other node at Mar 15 12:02:35, when you see the only additional line of output. I don't see any particular difference between this and the previous result http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking enabled before, and still do now. Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 5:18 AM, emmanuel segura wrote: The first thing i seen in your clvmd log it's this = WARNING: Locking disabled. Be careful! This could corrupt your metadata. = I saw that too, and thought the same as you did. I did some checks (see below), but some web searches suggest that this message is a normal consequence of clvmd initialization; e.g., http://markmail.org/message/vmy53pcv52wu7ghx use this command lvmconf --enable-cluster and remember for cman+pacemaker you don't need qdisk Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf: http://pastebin.com/841VZRzW and the output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf. Then I did as you suggested, but with a check to see if anything changed: # cd /etc/lvm/ # cp lvm.conf lvm.conf.cluster # lvmconf --enable-cluster # diff lvm.conf lvm.conf.cluster # So the key lines have been there all along: locking_type = 3 fallback_to_local_locking = 0 Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 9:20 AM, emmanuel segura wrote: Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem For what it's worth, here's the output of running clvmd -d on the node that stays up: http://pastebin.com/sWjaxAEF What's probably important in that big mass of output are the last two lines. Up to that point, I have both nodes up and running cman + clvmd; cluster.conf is here: http://pastebin.com/w5XNYyAX At the time of the next-to-the-last line, I cut power to the other node. At the time of the last line, I run vgdisplay on the remaining node, which hangs forever. After a lot of web searching, I found that I'm not the only one with this problem. Here's one case that doesn't seem relevant to me, since I don't use qdisk: http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html. Here's one with the same problem with the same OS: http://bugs.centos.org/view.php?id=5229, but with no resolution. Out of curiosity, has anyone on this list made a two-node cman+clvmd cluster work for them? Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on Il giorno 13 marzo 2012 23:29, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html As a test, I tried to fence my test node via standard means: stonith_admin -F orestes-corosync.nevis.columbia.edu These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing: http://pastebin.com/jaH820Bv. Unfortunately, I still got the gfs2 freeze, so this is not the complete story. First things first. I vaguely recall a web page that went over the STONITH return codes, but I can't locate it again. Is there any reference to the return codes expected from a fencing agent, perhaps as function of the state of the fencing device? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello William i did new you are using drbd and i dont't know what type of configuration you using But it's better you try to start clvm with clvmd -d like thak we can see what it's the problem Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/14/12 6:02 AM, emmanuel segura wrote: I think it's better you make clvmd start at boot chkconfig cman on ; chkconfig clvmd on I've already tried it. It doesn't work. The problem is that my LVM information is on the drbd. If I start up clvmd before drbd, it won't find the logical volumes. I also don't see why that would make a difference (although this could be part of the confusion): a service is a service. I've tried starting up clvmd inside and outside pacemaker control, with the same problem. Why would starting clvmd at boot make a difference? Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 3/13/12 5:50 PM, emmanuel segura wrote: So if you using cman why you use lsb::clvmd I think you are very confused I don't dispute that I may be very confused! However, from what I can tell, I still need to run clvmd even if I'm running cman (I'm not using rgmanager). If I just run cman, gfs2 and any other form of mount fails. If I run cman, then clvmd, then gfs2, everything behaves normally. Going by these instructions: https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorialhttps://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial the resources he puts under cluster control (rgmanager) I have to put under pacemaker control. Those include drbd, clvmd, and gfs2. The difference between what I've got, and what's in Clusters From Scratch, is in CFS they assign one DRBD volume to a single filesystem. I create an LVM physical volume on my DRBD resource, as in the above tutorial, and so I have to start clvmd or the logical volumes in the DRBD partition won't be recognized. Is there some way to get logical volumes recognized automatically by cman without rgmanager that I've missed? Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd)** /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_** Linux/6/html/Global_File_**System_2/s1-gfs2hand-allnodes.**htmlhttp://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html As a test, I tried to fence my test node via standard means: stonith_admin -F orestes-corosync.nevis.**columbia.eduhttp://orestes-corosync.nevis.columbia.edu These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing:http://pastebin.com/**jaH820Bv http://pastebin.com/jaH820Bv . Unfortunately, I still got the gfs2 freeze, so this is not the complete story. First things first. I vaguely recall a web page that went over the STONITH return codes, but I can't locate it again. Is there any reference to the return codes expected from a fencing agent, perhaps as function of the state of the fencing device? -- Bill Seligman | mailto://seligman@nevis.**columbia.eduselig...@nevis.columbia.edu Nevis Labs, Columbia Univ | http
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Sorry Willian But i think clvmd it must be used with ocf:lvm2:clvmd esample crm confgiure primitive clvmd ocf:lvm2:clvmd params daemon_timeout=30 clone cln_clvmd clvmd and rember clvmd depend on dlm, so for the dlm you sould same Il giorno 13 marzo 2012 17:29, William Seligman selig...@nevis.columbia.edu ha scritto: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 cluster.conf: http://pastebin.com/w5XNYyAX output of crm configure show: http://pastebin.com/atVkXjkn output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log show nothing. When I shut down power to one nodes (orestes-tb), the output of grep -E (dlm|gfs2|clvmd) /var/log/messages is http://pastebin.com/vjpvCFeN. -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
Hello Willian So if you using cman why you use lsb::clvmd I think you are very confused Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/13/12 12:29 PM, William Seligman wrote: I'm not sure if this is a Linux-HA question; please direct me to the appropriate list if it's not. I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in Clusters From Scratch. Fencing is through forcibly rebooting a node by cutting and restoring its power via UPS. My fencing/failover tests have revealed a problem. If I gracefully turn off one node (crm node standby; service pacemaker stop; shutdown -r now) all the resources transfer to the other node with no problems. If I cut power to one node (as would happen if it were fenced), the lsb::clvmd resource on the remaining node eventually fails. Since all the other resources depend on clvmd, all the resources on the remaining node stop and the cluster is left with nothing running. I've traced why the lsb::clvmd fails: The monitor/status command includes vgdisplay, which hangs indefinitely. Therefore the monitor will always time-out. So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is cut off, the cluster isn't handling it properly. Has anyone on this list seen this before? Any ideas? Details: versions: Redhat Linux 6.2 (kernel 2.6.32) cman-3.0.12.1 corosync-1.4.1 pacemaker-1.1.6 lvm2-2.02.87 lvm2-cluster-2.02.87 This may be a Linux-HA question after all! I ran a few more tests. Here's the output from a typical test of grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages http://pastebin.com/uqC6bc1b It looks like what's happening is that the fence agent (one I wrote) is not returning the proper error code when a node crashes. According to this page, if a fencing agent fails GFS2 will freeze to protect the data: http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html As a test, I tried to fence my test node via standard means: stonith_admin -F orestes-corosync.nevis.columbia.edu These were the log messages, which show that stonith_admin did its job and CMAN was notified of the fencing: http://pastebin.com/jaH820Bv. Unfortunately, I still got the gfs2 freeze, so this is not the complete story. First things first. I vaguely recall a web page that went over the STONITH return codes, but I can't locate it again. Is there any reference to the return codes expected from a fencing agent, perhaps as function of the state of the fencing device? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apparent problem in pacemaker ordering
are you sure the exportfs agent can be use it with clone active/active? Il giorno 03 marzo 2012 00:12, William Seligman selig...@nevis.columbia.edu ha scritto: One step forward, two steps back. I'm working on a two-node primary-primary cluster. I'm debugging problems I have with the ocf:heartbeat:exportfs resource. For some reason, pacemaker sometimes appears to ignore ordering I put on the resources. Florian Haas recommended pastebin in another thread, so let's give it a try. Here's my complete current output of crm configure show: http://pastebin.com/bbSsqyeu Here's a quick sketch: The sequence of events is supposed to be DRBD (ms) - clvmd (clone) - gfs2 (clone) - exportfs (clone). But that's not what happens. What happens is that pacemaker tries to start up the exportfs resource immediately. This fails, because what it's exporting doesn't exist until after gfs2 runs. Because the cloned resource can't run on either node, the cluster goes into a state in which one node is fenced, the other node refuses to run anything. Here's a quick snapshot I was able to take of the output of crm_mon that shows the problem: http://pastebin.com/CiZvS4Fh This shows that pacemaker is still trying to start the exportfs resources, before it has run the chain drbd-clvmd-gfs2. Just to confirm the obvious, I have the ordering constraints in the full configuration linked above (Admin is my DRBD resource): order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone This is not the only time I've observed this behavior in pacemaker. Here's a lengthy log file excerpt from the same time I took the crm_mon snapshot: http://pastebin.com/HwMUCmcX I can see that other resources, the symlink ones in particular, are being probed and started before the drbd Admin resource has a chance to be promoted. In looking at the log file, it may help to know that /mail and /var/nevis are gfs2 partitions that aren't mounted until the Gfs2 resource starts. So this isn't the first time I've seen this happen. This is just the first time I've been able to reproduce this reliably and capture a snapshot. Any ideas? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Il giorno 01 marzo 2012 01:03, William Seligman selig...@nevis.columbia.edu ha scritto: On 2/28/12 7:26 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic Uhm, I do think that is exactly on topic. Rather fix your resources to be able to successfully take over, than add even more complexity. What resources would that be, and why are they not taking over? I can't tell you in detail, because the major snafu happened on a production system after a power outage a few months ago. My goal was to get the thing stable as quickly as possible. In the end, that turned out to be a non-HA configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd. It works, in the sense that the users get their e-mail. If there's a power outage, I have to bring things up manually. So my only reference is the test-bench dual-primary setup I've got now, which is exhibiting the same kinds of problems even though the OS versions, software versions, and layout are different. This suggests that the problem lies in the way I'm setting up the configuration. The problems I have seem to be in the general category of the 'good guy' gets fenced when the 'bad guy' gets into trouble. Examples: - Assuming I start out with two crashed nodes. If I just start up DRBD and nothing else, the partitions sync quickly with no problems. - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Same thing if cman+pacemaker is running. - Cloned ocf:heartbeat:exportfs resources are giving me problems as well (which is why I tried making changes to that resource script). Assume I start with one node running cman+pacemaker, and the other stopped. I turned on the stopped node. This will typically result in the running node being fenced, because it has it times out when stopping the exportfs resource. Falling back to DRBD 8.3.12 didn't change this behavior. My pacemaker configuration is long, so I'll excerpt what I think are the relevant pieces in the hope that it will be enough for someone to say You fool! This is covered in Pacemaker Explained page 56! When bringing up a stopped node, in order to restart AdminClone pacemaker wants to stop ExportsClone, then Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail on the running node that causes it to be fenced. primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 \ clone-max=2 clone-node-max=1 notify=true primitive Clvmd lsb:clvmd op monitor interval=30s clone ClvmdClone Clvmd colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start primitive Gfs2 lsb:gfs2 op monitor interval=30s clone Gfs2Clone Gfs2 colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone primitive ExportMail ocf:heartbeat:exportfs \ op start interval=0 timeout=40 \ op stop interval=0 timeout=45 \ params clientspec=mail directory=/mail fsid=30 clone ExportsClone ExportMail colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA|
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
try to change the fence daemon tag like this fence_daemon clean_start=1 post_join_delay=30 / change your cluster config version and after reboot the cluster Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/1/12 4:15 AM, emmanuel segura wrote: can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu nodeid=1 altname name=hypatia-private.nevis.**columbia.eduhttp://hypatia-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu / /method /fence /clusternode clusternode name=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu nodeid=2 altname name=orestes-private.nevis.**columbia.eduhttp://orestes-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu / /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices fence_daemon post_join_delay=30 / rm disabled=1 / /cluster Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 2/28/12 7:26 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic Uhm, I do think that is exactly on topic. Rather fix your resources to be able to successfully take over, than add even more complexity. What resources would that be, and why are they not taking over? I can't tell you in detail, because the major snafu happened on a production system after a power outage a few months ago. My goal was to get the thing stable as quickly as possible. In the end, that turned out to be a non-HA configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd. It works, in the sense that the users get their e-mail. If there's a power outage, I have to bring things up manually. So my only reference is the test-bench dual-primary setup I've got now, which is exhibiting the same kinds of problems even though the OS versions, software versions, and layout are different. This suggests that the problem lies in the way I'm setting up the configuration. The problems I have seem to be in the general category of the 'good guy' gets fenced when the 'bad guy' gets into trouble. Examples: - Assuming I start out with two crashed nodes. If I just start up DRBD and nothing else, the partitions sync quickly with no problems. - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Same thing if cman+pacemaker is running. - Cloned ocf:heartbeat:exportfs resources are giving me problems as well (which is why I tried making changes to that resource script). Assume I start with one node running cman+pacemaker, and the other stopped. I turned on the stopped node. This will typically result in the running node being fenced, because it has it times out when stopping the exportfs resource. Falling back to DRBD 8.3.12 didn't change this behavior. My pacemaker configuration is long, so I'll excerpt what I think are the relevant pieces in the hope that it will be enough for someone
Re: [Linux-HA] cman+pacemaker+drbd fencing problem
Ok william if this it'sn the problem, when you show me your pacemaker cib xml crm configure show OUTPUT Il giorno 01 marzo 2012 18:10, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/1/12 6:34 AM, emmanuel segura wrote: try to change the fence daemon tag like this fence_daemon clean_start=1 post_join_delay=30 / change your cluster config version and after reboot the cluster This did not change the behavior of the cluster. In particular, I'm still dealing with this: - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/1/12 4:15 AM, emmanuel segura wrote: can you show me your /etc/cluster/cluster.conf? because i think your problem it's a fencing-loop Here it is: /etc/cluster/cluster.conf: ?xml version=1.0? cluster config_version=17 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb.nevis.**columbia.edu http://hypatia-tb.nevis.columbia.edu nodeid=1 altname name=hypatia-private.nevis.**columbia.edu http://hypatia-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=hypatia-tb.nevis.**columbia.edu http://hypatia-tb.nevis.columbia.edu / /method /fence /clusternode clusternode name=orestes-tb.nevis.**columbia.edu http://orestes-tb.nevis.columbia.edu nodeid=2 altname name=orestes-private.nevis.**columbia.edu http://orestes-private.nevis.columbia.edu port=5405 mcast=226.94.1.1/ fence method name=pcmk-redirect device name=pcmk port=orestes-tb.nevis.**columbia.edu http://orestes-tb.nevis.columbia.edu / /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices fence_daemon post_join_delay=30 / rm disabled=1 / /cluster Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.** columbia.edu selig...@nevis.columbia.edu ha scritto: On 2/28/12 7:26 PM, Lars Ellenberg wrote: On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote: off-topic Sigh. I wish that were the reason. The reason why I'm doing dual-primary is that I've a got a single-primary two-node cluster in production that simply doesn't work. One node runs resources; the other sits and twiddles its fingers; fine. But when primary goes down, secondary has trouble starting up all the resources; when we've actually had primary failures (UPS goes haywire, hard drive failure) the secondary often winds up in a state in which it runs none of the significant resources. With the dual-primary setup I have now, both machines are running the resources that typically cause problems in my single-primary configuration. If one box goes down, the other doesn't have to failover anything; it's already running them. (I needed IPaddr2 cloning to work properly for this to work, which is why I started that thread... and all the stupider of me for missing that crucial page in Clusters From Scratch.) My only remaining problem with the configuration is restoring a fenced node to the cluster. Hence my tests, and the reason why I started this thread. /off-topic Uhm, I do think that is exactly on topic. Rather fix your resources to be able to successfully take over, than add even more complexity. What resources would that be, and why are they not taking over? I can't tell you in detail, because the major snafu happened on a production system after a power outage a few months ago. My goal was to get the thing stable as quickly as possible. In the end, that turned out to be a non-HA configuration: One runs corosync+pacemaker+drbd, while the other just runs drbd. It works, in the sense that the users get their e-mail. If there's a power outage, I have to bring things up manually. So my only reference is the test-bench dual-primary setup I've got now, which is exhibiting the same kinds of problems even though the OS versions, software versions, and layout are different. This suggests that the problem lies in the way I'm setting up the configuration. The problems I have seem to be in the general category of the 'good guy' gets fenced when the 'bad guy' gets into trouble. Examples: - Assuming I start out with two crashed nodes. If I just start up DRBD and nothing else, the partitions sync quickly with no problems. - If the system starts with cman running, and I start drbd, it's likely that system who is _not_ Outdated will be fenced (rebooted). Same thing if cman+pacemaker is running
Re: [Linux-HA] Pacemaker - Resources not staying together
colocation altogether inf: apache mysql drbd_fs drbd_ms:Master 2012/2/10 Ryan Stepalavich rstepalav...@gmail.com I'm using Pacemaker to handle my cluster resources (on top of heartbeat). Everything works except the collocation parameter. I want all of my resources to stay on the same node at all times. Here's what my cluster looks like right now: --- Last updated: Fri Feb 10 16:52:10 2012 Stack: Heartbeat Current DC: svrmntr01 (715d1b92-3849-4dab-8d4a-a3b3a4f4efc3) - partition with quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, unknown expected votes 4 Resources configured. Online: [ svrmntr01 svrmntr02 ] Master/Slave Set: drbd_ms [drbd] Masters: [ svrmntr01 ] Slaves: [ svrmntr02 ] drbd_fs (ocf::heartbeat:Filesystem):Started svrmntr01 apache (lsb:apache2): Started svrmntr02 (unmanaged) FAILED mysql (lsb:mysql) Started [ svrmntr01 svrmntr02 ] Failed actions: apache_monitor_0 (node=svrmntr02, call=4, rc=127, status=complete): unknown apache_stop_0 (node=svrmntr02, call=6, rc=127, status=complete): unknown - Here's my current configuration: - node $id=715d1b92-3849-4dab-8d4a-a3b3a4f4efc3 svrmntr01 node $id=af6fe9bc-b89d-4460-9b50-3039bbd9e144 svrmntr02 \ attributes standby=off primitive apache lsb:apache2 \ meta target-role=Started primitive drbd ocf:linbit:drbd \ params drbd_resource=lamp \ op monitor interval=60s primitive drbd_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/srv/data fstype=ext4 primitive mysql lsb:mysql ms drbd_ms drbd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location cli-standby-apache apache \ rule $id=cli-standby-rule-apache -inf: #uname eq svrmntr02 colocation altogether inf: drbd_fs drbd_ms:Master apache mysql order fs_after_drbd_then_lamp inf: drbd_ms:promote drbd_fs:start mysql:start apache:start property $id=cib-bootstrap-options \ dc-version=1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=Heartbeat \ stonith-enabled=false Anybody have an idea as to what I'm doing wrong? Thanks! ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Status about ocfs2.pcmk ?
Sorry But can we see your configuration? 2012/2/2 alain.mou...@bull.net Hi Don't remember in details, it was at the end of 2010 ... but : Why pacemaker is fencing a node ? because it was one of my simple HA test : for example, make the heartbeat no more working so that pacemaker fences the node, and it did it for sure, but on the remaining node, sometimes ocfs2 crashes the node, or the FS ocfs2 was passing in read-only ... I did simple HA tests, such as this one above, to check the robustness of the configuration and many times, it failed and make the HA cluster down. That's why I ask for status today. Alain De :Ulrich Windl ulrich.wi...@rz.uni-regensburg.de A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 02/02/2012 16:01 Objet : Re: [Linux-HA] Antw: Status about ocfs2.pcmk ? Envoyé par :linux-ha-boun...@lists.linux-ha.org alain.mou...@bull.net schrieb am 02.02.2012 um 16:05 in Nachricht of5716529b.0ca482e6-onc1257998.0050e240-c1257998.00518...@bull.net: Hi Thanks. OK but I also could mount the FS , the problems in 2010 on RHEL was that the configuration was not robust , meaning that during validation tests, there were often cases where both nodes were dead : one fenced by Pacemaker and the other killed itself by ocfs2, or problems on mount read-only FS after failover , etc. Hi! I'd start inspecting the logs: Why is pacemaker fencing a node? Why is the filesytem read-only? Regards, Ulrich Alain De :Ulrich Windl ulrich.wi...@rz.uni-regensburg.de A : linux-ha@lists.linux-ha.org Date : 02/02/2012 15:33 Objet : [Linux-HA] Antw: Status about ocfs2.pcmk ? Envoyé par :linux-ha-boun...@lists.linux-ha.org Hi! I have something running using OCFS on SLES11 SP1: ocf:pacemaker:controld ocf:ocfs2:o2cb At least I could mount the filesystem with it: /dev/drbd_r0 on /exports/ocfs/samba type ocfs2 (rw,_netdev,acl,cluster_stack=pcmk) Regards, Ulrich alain.mou...@bull.net schrieb am 02.02.2012 um 14:54 in Nachricht offed8db74.9970c9e4-onc1257998.004a64db-c1257998.004b1...@bull.net: Hi Just wonder if someone has succeded to configured a working HA configuration with Pacemaker/corosync and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly (and eventually SLES11) ? (I tried at the end of 2010 but gave up after a few weeks because it was not working at all) Thanks if someone can give a status? Regards Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
William try to follow the suggestion of Arnold In my case it's different because we don't use drbd we are using SAN with ocfs2 But i think for drbd in dual primary you need the attribute master-max=2 2012/1/31 William Seligman selig...@nevis.columbia.edu On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote: On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote: But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave At least when you use the linbit-ra, using it without a master-clone will give you one(!) slave only. When you use a normal clone with two clones, you will get two slaves. The RA only goes primary on promote, that is when its in master-state. = You need a master-clone of two clones with 1-2 masters to use drbd in the cluster. If I understand Emmanual's suggestion: The only way I know how to implement this is to create a simple clone group with lsb::drbd instead of Linbit's drbd resource, and put become-primary-on for both my nodes in drbd.conf. This might work in the short term, but I think it's risky in the long term. For example: Something goes wrong and node A stoniths node B. I bring node B back up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD partition with A. If I'm stupid (occupational hazard), I won't remember to edit drbd.conf before I do this, node B will automatically try to become primary, and probably get stonith'ed again. Arnold: I thought that was what I was doing with these statements: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 That is, master-max=2 means to promote two instances to master. Did I get it wrong? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
William can you try like this primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master clone Adming AdminDrbd 2012/1/31 William Seligman selig...@nevis.columbia.edu On 1/31/12 3:47 PM, emmanuel segura wrote: William try to follow the suggestion of Arnold In my case it's different because we don't use drbd we are using SAN with ocfs2 But i think for drbd in dual primary you need the attribute master-max=2 I did, or thought I did. Have I missed something? Again, from crm configure show: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op monitor interval=59s role=Slave \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 Still no promotion to primary on either node. 2012/1/31 William Seligman selig...@nevis.columbia.edu On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote: On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote: But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave At least when you use the linbit-ra, using it without a master-clone will give you one(!) slave only. When you use a normal clone with two clones, you will get two slaves. The RA only goes primary on promote, that is when its in master-state. = You need a master-clone of two clones with 1-2 masters to use drbd in the cluster. If I understand Emmanual's suggestion: The only way I know how to implement this is to create a simple clone group with lsb::drbd instead of Linbit's drbd resource, and put become-primary-on for both my nodes in drbd.conf. This might work in the short term, but I think it's risky in the long term. For example: Something goes wrong and node A stoniths node B. I bring node B back up, disabling cman+pacemaker before I do so, and want to re-sync node B's DRBD partition with A. If I'm stupid (occupational hazard), I won't remember to edit drbd.conf before I do this, node B will automatically try to become primary, and probably get stonith'ed again. Arnold: I thought that was what I was doing with these statements: primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 That is, master-max=2 means to promote two instances to master. Did I get it wrong? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote
Sorry William But if you wanna implement dual primary i think you don't nee promote for your drbd Try to use clone without master/slave 2012/1/30 William Seligman selig...@nevis.columbia.edu I'm trying to follow the directions for setting up a dual-primary DRBD setup with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't promote the DRBD resources to primary at either node. Here's the result of crm_mon: Last updated: Mon Jan 30 17:07:03 2012 Stack: cman Current DC: hypatia-tb - partition with quorum Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ orestes-tb hypatia-tb ] Master/Slave Set: AdminClone [AdminDrbd] Slaves: [ hypatia-tb orestes-tb ] /etc/cluster/cluster.conf: cluster config_version=6 name=Nevis_HA logging debug=off/ cman expected_votes=1 two_node=1 / clusternodes clusternode name=hypatia-tb nodeid=1 fence method name=pcmk-redirect device name=pcmk port=hypatia-tb/ /method /fence /clusternode clusternode name=orestes-tb nodeid=2 fence method name=pcmk-redirect device name=pcmk port=orestes-tb/ /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk/ /fencedevices !-- fence_daemon post_join_delay=30 / -- /cluster crm configure show: node hypatia-tb node orestes-tb primitive AdminDrbd ocf:linbit:drbd \ params drbd_resource=admin \ op monitor interval=60s role=Master \ op stop interval=0 timeout=320 \ op start interval=0 timeout=240 primitive Clvmd lsb:clvmd ms AdminClone AdminDrbd \ meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true clone ClvmdClone Clvmd colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start property $id=cib-bootstrap-options \ dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \ cluster-infrastructure=cman \ stonith-enabled=false DRBD looks OK: # cat /proc/drbd version: 8.4.0 (api:1/proto:86-100) GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@, 2012-01-25 19:10:28 0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r- ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 I can manually do drbdadm primary admin on both nodes and get a Primary/Primary state. That still does not get Pacemaker to promote the resource. The only vaguely relevant lines in /var/log/messages seem to be: Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output (AdminDrbd:0:start:stdout) Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output: (AdminDrbd:0:start:stderr) Could not map uname= hypatia-tb.nevis.columbia.edu to a UUID: The object/attribute does not exist Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output (AdminDrbd:0:start:stdout) I've tried running with iptables both on and off, and the results are the same. Any clues? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137| Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- esta es mi vida e me la vivo hasta que dios quiera ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems