from:"emmanuel segura"

Re: [Linux-HA] file system resource becomes inaccesible when any of the node goes down

2015-07-05 Thread emmanuel segura

when a node goes down, you will see the node in unclean state, how you
see in your logs, forming new configuration(corosync) - stonith
reboot request - and you are using sbd and the node become offline
after thet msgwait is expired, when msgwait is expired pacemaker knows
the node is dead and than the ocfs will ok, stonith-dlm-ocfs2, if
you need to reduce the msgwait you need to be careful about others
timeouts and cluster problems.

2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin m.sharfud...@nds.com.pk:
 SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
 openais-1.1.4-5.22.1.7)

 Its a dual primary drbd cluster, which mounts a file system resource on both
 the cluster nodes simultaneously(file system type is ocfs2).

 Whenever any of the nodes goes down, the file system(/sharedata) become
 inaccessible for exact 35 seconds on the other (surviving/online) node, and
 then become available again on the online node.

 Please help me understand why the node which survives or remains online
 unable to access the file system resource(/sharedata) for 35 seconds ? and
 how can I fix the cluster so that file system remains accessible on the
 surviving node without any interruption/delay(as in my case of about 35
 seconds)

 By inaccessible, I meant to say that running ls -l /sharedata and df
 /sharedata does not return any output and does not return the prompt back
 on the online node for exact 35 seconds once the other node becomes offline.

 e.g node1 got offline somewhere around  01:37:15, and then /sharedata file
 system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e
 node2.


 /var/log/messages on node2, when node1 went offline:
 Jul  5 01:37:26 node2 kernel: [  675.255865] drbd r0: PingAck did not arrive
 in time.
 Jul  5 01:37:26 node2 kernel: [  675.255886] drbd r0: peer( Primary -
 Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate - DUnknown )
 Jul  5 01:37:26 node2 kernel: [  675.256030] block drbd0: new current UUID
 C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B
 Jul  5 01:37:26 node2 kernel: [  675.256079] drbd r0: asender terminated
 Jul  5 01:37:26 node2 kernel: [  675.256081] drbd r0: Terminating drbd_a_r0
 Jul  5 01:37:26 node2 kernel: [  675.256306] drbd r0: Connection closed
 Jul  5 01:37:26 node2 kernel: [  675.256338] drbd r0: conn( NetworkFailure
 - Unconnected )
 Jul  5 01:37:26 node2 kernel: [  675.256339] drbd r0: receiver terminated
 Jul  5 01:37:26 node2 kernel: [  675.256340] drbd r0: Restarting receiver
 thread
 Jul  5 01:37:26 node2 kernel: [  675.256341] drbd r0: receiver (re)started
 Jul  5 01:37:26 node2 kernel: [  675.256344] drbd r0: conn( Unconnected -
 WFConnection )
 Jul  5 01:37:29 node2 corosync[4040]:  [TOTEM ] A processor failed, forming
 new configuration.
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.131)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
 Transitional membership event on ring 216: memb=1, new=0, lost=1
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
 memb: node2 739307908
 Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
 lost: node1 739307907
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] r(0) ip(172.16.241.132)
 Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice:
 plugin_handle_membership: Membership 216: quorum lost
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
 Jul  5 01:37:35 node2 crmd[4050]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 cib[4045]:   notice: plugin_handle_membership:
 Membership 216: quorum lost
 Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was
 member)
 Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was
 member)
 Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
 Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
 shutdown action on node1
 Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: crm_update_peer_state:
 plugin_handle_membership: Node node1[739307907] - state is now lost (was

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

please use pastebin and show your whole logs

2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 by the way.. just to note that.. for a normal testing (manual failover,
 rebooting the active node)... the cluster is working fine. I only encounter
 this error if I try to poweroff/shutoff the active node.

 On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com wrote:

 Hi.


 Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not
 available (stopped)
 Dec 29 13:47:16 s1 crmd[1515]:   notice: process_lrm_event: Operation
 vg1_monitor_0: not running (node=
 s1, call=23, rc=7, cib-update=40, confirmed=true)
 Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating action
 9: monitor fs1_monitor_0 on
 s1 (local)
 Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating action
 16: monitor vg1_monitor_0 on
  s2
 Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device
 [/dev/mapper/cluvg1-clulv1]. Ex
 pected /dev/??? to exist


 from the LVM agent, it checked if the volume is already available.. and
 will raise the above error if not. But, I don't see that it tries to
 activate it before raising the VG. Perhaps, it assumes that the VG is
 already activated... so, I'm not sure who should be activating it (should
 it be LVM?).


  if [ $rc -ne 0 ]; then
 ocf_log $loglevel LVM Volume $1 is not available
 (stopped)
 rc=$OCF_NOT_RUNNING
 else
 case $(get_vg_mode) in
 1) # exclusive with tagging.
 # If vg is running, make sure the correct tag is
 present. Otherwise we
 # can not guarantee exclusive activation.
 if ! check_tags; then
 ocf_exit_reason WARNING:
 $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\

 On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com
 wrote:

 logs?

 2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  Hi,
 
  just want to ask regarding the LVM resource agent on pacemaker/corosync.
 
  I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster
  works as expected, like doing a manual failover (via crm resource move),
  and automatic failover (by rebooting the active node for instance).
 But, if
  i try to just shutoff the active node (it's a VM, so I can do a
  poweroff). The resources won't be able to failover to the passive node.
  when I did an investigation, it's due to an LVM resource not starting
  (specifically, the VG). I found out that the LVM resource won't try to
  activate the volume group in the passive node. Is this an expected
  behaviour?
 
  what I really expect is that, in the event that the active node be
 shutoff
  (by a power outage for instance), all resources should be failover
  automatically to the passive. LVM should re-activate the VG.
 
 
  here's my config.
 
  node 1: s1
  node 2: s2
  primitive cluIP IPaddr2 \
  params ip=192.168.13.200 cidr_netmask=32 \
  op monitor interval=30s
  primitive clvm ocf:lvm2:clvmd \
  params daemon_timeout=30 \
  op monitor timeout=90 interval=30
  primitive dlm ocf:pacemaker:controld \
  op monitor interval=60s timeout=90s on-fail=ignore \
  op start interval=0 timeout=90
  primitive fs1 Filesystem \
  params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs
  primitive mariadb mysql \
  params config=/etc/my.cnf
  primitive sbd stonith:external/sbd \
  op monitor interval=15s timeout=60s
  primitive vg1 LVM \
  params volgrpname=cluvg1 exclusive=yes \
  op start timeout=10s interval=0 \
  op stop interval=0 timeout=10 \
  op monitor interval=10 timeout=30 on-fail=restart depth=0
  group base-group dlm clvm
  group rgroup cluIP vg1 fs1 mariadb \
  meta target-role=Started
  clone base-clone base-group \
  meta interleave=true target-role=Started
  property cib-bootstrap-options: \
  dc-version=1.1.12-1.1.12.git20140904.266d5c2 \
  cluster-infrastructure=corosync \
  no-quorum-policy=ignore \
  last-lrm-refresh=1419514875 \
  cluster-name=xxx \
  stonith-enabled=true
  rsc_defaults rsc-options: \
  resource-stickiness=100
 
  --
  import this
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems



 --
 esta es mi vida e me la vivo hasta que dios quiera
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
  import this




 --
 import this
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

Sorry,

But your paste is empty.

2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 hi,

 uploaded it here.

 http://susepaste.org/45413433

 thanks.

 On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com wrote:

 Ok, i attached the log file of one of the nodes.

 On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com
 wrote:

 please use pastebin and show your whole logs

 2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  by the way.. just to note that.. for a normal testing (manual failover,
  rebooting the active node)... the cluster is working fine. I only
 encounter
  this error if I try to poweroff/shutoff the active node.
 
  On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com
 wrote:
 
  Hi.
 
 
  Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not
  available (stopped)
  Dec 29 13:47:16 s1 crmd[1515]:   notice: process_lrm_event: Operation
  vg1_monitor_0: not running (node=
  s1, call=23, rc=7, cib-update=40, confirmed=true)
  Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
 action
  9: monitor fs1_monitor_0 on
  s1 (local)
  Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
 action
  16: monitor vg1_monitor_0 on
   s2
  Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find device
  [/dev/mapper/cluvg1-clulv1]. Ex
  pected /dev/??? to exist
 
 
  from the LVM agent, it checked if the volume is already available.. and
  will raise the above error if not. But, I don't see that it tries to
  activate it before raising the VG. Perhaps, it assumes that the VG is
  already activated... so, I'm not sure who should be activating it
 (should
  it be LVM?).
 
 
   if [ $rc -ne 0 ]; then
  ocf_log $loglevel LVM Volume $1 is not available
  (stopped)
  rc=$OCF_NOT_RUNNING
  else
  case $(get_vg_mode) in
  1) # exclusive with tagging.
  # If vg is running, make sure the correct tag
 is
  present. Otherwise we
  # can not guarantee exclusive activation.
  if ! check_tags; then
  ocf_exit_reason WARNING:
  $OCF_RESKEY_volgrpname is active without the cluster tag, \$OUR_TAG\
 
  On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura emi2f...@gmail.com
  wrote:
 
  logs?
 
  2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   Hi,
  
   just want to ask regarding the LVM resource agent on
 pacemaker/corosync.
  
   I setup 2 nodes cluster (opensuse13.2 -- my config below). The
 cluster
   works as expected, like doing a manual failover (via crm resource
 move),
   and automatic failover (by rebooting the active node for instance).
  But, if
   i try to just shutoff the active node (it's a VM, so I can do a
   poweroff). The resources won't be able to failover to the passive
 node.
   when I did an investigation, it's due to an LVM resource not
 starting
   (specifically, the VG). I found out that the LVM resource won't try
 to
   activate the volume group in the passive node. Is this an expected
   behaviour?
  
   what I really expect is that, in the event that the active node be
  shutoff
   (by a power outage for instance), all resources should be failover
   automatically to the passive. LVM should re-activate the VG.
  
  
   here's my config.
  
   node 1: s1
   node 2: s2
   primitive cluIP IPaddr2 \
   params ip=192.168.13.200 cidr_netmask=32 \
   op monitor interval=30s
   primitive clvm ocf:lvm2:clvmd \
   params daemon_timeout=30 \
   op monitor timeout=90 interval=30
   primitive dlm ocf:pacemaker:controld \
   op monitor interval=60s timeout=90s on-fail=ignore \
   op start interval=0 timeout=90
   primitive fs1 Filesystem \
   params device=/dev/mapper/cluvg1-clulv1 directory=/data
 fstype=btrfs
   primitive mariadb mysql \
   params config=/etc/my.cnf
   primitive sbd stonith:external/sbd \
   op monitor interval=15s timeout=60s
   primitive vg1 LVM \
   params volgrpname=cluvg1 exclusive=yes \
   op start timeout=10s interval=0 \
   op stop interval=0 timeout=10 \
   op monitor interval=10 timeout=30 on-fail=restart depth=0
   group base-group dlm clvm
   group rgroup cluIP vg1 fs1 mariadb \
   meta target-role=Started
   clone base-clone base-group \
   meta interleave=true target-role=Started
   property cib-bootstrap-options: \
   dc-version=1.1.12-1.1.12.git20140904.266d5c2 \
   cluster-infrastructure=corosync \
   no-quorum-policy=ignore \
   last-lrm-refresh=1419514875 \
   cluster-name=xxx \
   stonith-enabled=true
   rsc_defaults rsc-options: \
   resource-stickiness=100
  
   --
   import this
   ___
   Linux-HA mailing list
   Linux-HA@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha
   See also: http://linux-ha.org/ReportingProblems
 
 
 
  --
  esta es mi vida e me la vivo hasta que dios quiera

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

Hi,

You have  a problem with the cluster stonithd:error: crm_abort:
crm_glib_handler: Forked child 6186 to record non-fatal assert at
logging.c:73 

Try to post your cluster version(packages), maybe someone can tell you
if this is a known bug or new.



2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 ok, sorry for that.. please use this instead.

 http://pastebin.centos.org/14771/

 thanks.

 On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com wrote:

 Sorry,

 But your paste is empty.

 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  hi,
 
  uploaded it here.
 
  http://susepaste.org/45413433
 
  thanks.
 
  On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com
 wrote:
 
  Ok, i attached the log file of one of the nodes.
 
  On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com
  wrote:
 
  please use pastebin and show your whole logs
 
  2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   by the way.. just to note that.. for a normal testing (manual
 failover,
   rebooting the active node)... the cluster is working fine. I only
  encounter
   this error if I try to poweroff/shutoff the active node.
  
   On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com
  wrote:
  
   Hi.
  
  
   Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not
   available (stopped)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: process_lrm_event:
 Operation
   vg1_monitor_0: not running (node=
   s1, call=23, rc=7, cib-update=40, confirmed=true)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
  action
   9: monitor fs1_monitor_0 on
   s1 (local)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
  action
   16: monitor vg1_monitor_0 on
s2
   Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find
 device
   [/dev/mapper/cluvg1-clulv1]. Ex
   pected /dev/??? to exist
  
  
   from the LVM agent, it checked if the volume is already available..
 and
   will raise the above error if not. But, I don't see that it tries to
   activate it before raising the VG. Perhaps, it assumes that the VG
 is
   already activated... so, I'm not sure who should be activating it
  (should
   it be LVM?).
  
  
if [ $rc -ne 0 ]; then
   ocf_log $loglevel LVM Volume $1 is not available
   (stopped)
   rc=$OCF_NOT_RUNNING
   else
   case $(get_vg_mode) in
   1) # exclusive with tagging.
   # If vg is running, make sure the correct
 tag
  is
   present. Otherwise we
   # can not guarantee exclusive activation.
   if ! check_tags; then
   ocf_exit_reason WARNING:
   $OCF_RESKEY_volgrpname is active without the cluster tag,
 \$OUR_TAG\
  
   On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura 
 emi2f...@gmail.com
   wrote:
  
   logs?
  
   2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
Hi,
   
just want to ask regarding the LVM resource agent on
  pacemaker/corosync.
   
I setup 2 nodes cluster (opensuse13.2 -- my config below). The
  cluster
works as expected, like doing a manual failover (via crm resource
  move),
and automatic failover (by rebooting the active node for
 instance).
   But, if
i try to just shutoff the active node (it's a VM, so I can do a
poweroff). The resources won't be able to failover to the passive
  node.
when I did an investigation, it's due to an LVM resource not
  starting
(specifically, the VG). I found out that the LVM resource won't
 try
  to
activate the volume group in the passive node. Is this an
 expected
behaviour?
   
what I really expect is that, in the event that the active node
 be
   shutoff
(by a power outage for instance), all resources should be
 failover
automatically to the passive. LVM should re-activate the VG.
   
   
here's my config.
   
node 1: s1
node 2: s2
primitive cluIP IPaddr2 \
params ip=192.168.13.200 cidr_netmask=32 \
op monitor interval=30s
primitive clvm ocf:lvm2:clvmd \
params daemon_timeout=30 \
op monitor timeout=90 interval=30
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=90s on-fail=ignore \
op start interval=0 timeout=90
primitive fs1 Filesystem \
params device=/dev/mapper/cluvg1-clulv1 directory=/data
  fstype=btrfs
primitive mariadb mysql \
params config=/etc/my.cnf
primitive sbd stonith:external/sbd \
op monitor interval=15s timeout=60s
primitive vg1 LVM \
params volgrpname=cluvg1 exclusive=yes \
op start timeout=10s interval=0 \
op stop interval=0 timeout=10 \
op monitor interval=10 timeout=30 on-fail=restart depth=0
group base-group dlm clvm
group rgroup cluIP vg1 fs1 mariadb \
meta target-role=Started
clone base-clone base-group \
meta

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error:
Permission deniedPermission deniedI/O warning : failed to load
external entity /var/lib/pacemaker/cib/cib.xml
Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot
link /var/lib/pacemaker/cib/cib.xml to
/var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1)

2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com:
 Hi,

 You have  a problem with the cluster stonithd:error: crm_abort:
 crm_glib_handler: Forked child 6186 to record non-fatal assert at
 logging.c:73 

 Try to post your cluster version(packages), maybe someone can tell you
 if this is a known bug or new.



 2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 ok, sorry for that.. please use this instead.

 http://pastebin.centos.org/14771/

 thanks.

 On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com wrote:

 Sorry,

 But your paste is empty.

 2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  hi,
 
  uploaded it here.
 
  http://susepaste.org/45413433
 
  thanks.
 
  On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao marlon.g...@gmail.com
 wrote:
 
  Ok, i attached the log file of one of the nodes.
 
  On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura emi2f...@gmail.com
  wrote:
 
  please use pastebin and show your whole logs
 
  2014-12-29 9:06 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   by the way.. just to note that.. for a normal testing (manual
 failover,
   rebooting the active node)... the cluster is working fine. I only
  encounter
   this error if I try to poweroff/shutoff the active node.
  
   On Mon, Dec 29, 2014 at 4:05 PM, Marlon Guao marlon.g...@gmail.com
  wrote:
  
   Hi.
  
  
   Dec 29 13:47:16 s1 LVM(vg1)[1601]: WARNING: LVM Volume cluvg1 is not
   available (stopped)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: process_lrm_event:
 Operation
   vg1_monitor_0: not running (node=
   s1, call=23, rc=7, cib-update=40, confirmed=true)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
  action
   9: monitor fs1_monitor_0 on
   s1 (local)
   Dec 29 13:47:16 s1 crmd[1515]:   notice: te_rsc_command: Initiating
  action
   16: monitor vg1_monitor_0 on
s2
   Dec 29 13:47:16 s1 Filesystem(fs1)[1618]: WARNING: Couldn't find
 device
   [/dev/mapper/cluvg1-clulv1]. Ex
   pected /dev/??? to exist
  
  
   from the LVM agent, it checked if the volume is already available..
 and
   will raise the above error if not. But, I don't see that it tries to
   activate it before raising the VG. Perhaps, it assumes that the VG
 is
   already activated... so, I'm not sure who should be activating it
  (should
   it be LVM?).
  
  
if [ $rc -ne 0 ]; then
   ocf_log $loglevel LVM Volume $1 is not available
   (stopped)
   rc=$OCF_NOT_RUNNING
   else
   case $(get_vg_mode) in
   1) # exclusive with tagging.
   # If vg is running, make sure the correct
 tag
  is
   present. Otherwise we
   # can not guarantee exclusive activation.
   if ! check_tags; then
   ocf_exit_reason WARNING:
   $OCF_RESKEY_volgrpname is active without the cluster tag,
 \$OUR_TAG\
  
   On Mon, Dec 29, 2014 at 3:36 PM, emmanuel segura 
 emi2f...@gmail.com
   wrote:
  
   logs?
  
   2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
Hi,
   
just want to ask regarding the LVM resource agent on
  pacemaker/corosync.
   
I setup 2 nodes cluster (opensuse13.2 -- my config below). The
  cluster
works as expected, like doing a manual failover (via crm resource
  move),
and automatic failover (by rebooting the active node for
 instance).
   But, if
i try to just shutoff the active node (it's a VM, so I can do a
poweroff). The resources won't be able to failover to the passive
  node.
when I did an investigation, it's due to an LVM resource not
  starting
(specifically, the VG). I found out that the LVM resource won't
 try
  to
activate the volume group in the passive node. Is this an
 expected
behaviour?
   
what I really expect is that, in the event that the active node
 be
   shutoff
(by a power outage for instance), all resources should be
 failover
automatically to the passive. LVM should re-activate the VG.
   
   
here's my config.
   
node 1: s1
node 2: s2
primitive cluIP IPaddr2 \
params ip=192.168.13.200 cidr_netmask=32 \
op monitor interval=30s
primitive clvm ocf:lvm2:clvmd \
params daemon_timeout=30 \
op monitor timeout=90 interval=30
primitive dlm ocf:pacemaker:controld \
op monitor interval=60s timeout=90s on-fail=ignore \
op start interval=0 timeout=90
primitive fs1 Filesystem \
params device=/dev/mapper/cluvg1-clulv1 directory=/data
  fstype=btrfs
primitive mariadb mysql \
params config=/etc/my.cnf
primitive sbd

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

Dlm isn't the problem, but i think is your fencing, when you powered
off the active node, the dead remain in unclean state? can you show me
your sbd timeouts? sbd -d /dev/path_of_your_device dump

Thanks

2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 Hi,

 ah yeah.. tried to poweroff the active node.. and tried pvscan on the
 passive.. and yes.. it didn't worked --- it doesn't return to the shell.
 So, the problem is on DLM?

 On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com wrote:

 Power off the active node and after one seconde try to use one lvm
 command, for example pvscan, if this command doesn't response is
 because dlm relay on cluster fencing, if the cluster fencing doesn't
 work the dlm state in blocked state.

 2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  perhaps, we need to focus on this message. as mentioned.. the cluster is
  working fine under normal circumstances. my only concern is that, LVM
  resource agent doesn't try to re-activate the VG on the passive node when
  the active node goes down ungracefully (powered off). Hence, it could not
  mount the filesystems.. etc.
 
 
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  sbd_monitor_0: not running (node=
  s1, call=5, rc=7, cib-update=35, confirmed=true)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  13: monitor dlm:0_monitor_0
  on s2
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  5: monitor dlm:1_monitor_0 o
  n s1 (local)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  dlm_monitor_0: not running (node=
  s1, call=10, rc=7, cib-update=36, confirmed=true)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  14: monitor clvm:0_monitor_0
   on s2
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  6: monitor clvm:1_monitor_0
  on s1 (local)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  clvm_monitor_0: not running (node
  =s1, call=15, rc=7, cib-update=37, confirmed=true)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  15: monitor cluIP_monitor_0
  on s2
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  7: monitor cluIP_monitor_0 o
  n s1 (local)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  cluIP_monitor_0: not running (nod
  e=s1, call=19, rc=7, cib-update=38, confirmed=true)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  16: monitor vg1_monitor_0 on
   s2
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  8: monitor vg1_monitor_0 on
  s1 (local)
  Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is not
  available (stopped)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  vg1_monitor_0: not running (node=
  s1, call=23, rc=7, cib-update=39, confirmed=true)
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  17: monitor fs1_monitor_0 on
   s2
  Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
 action
  9: monitor fs1_monitor_0 on
  s1 (local)
  Dec 29 17:12:26 s1 Filesystem(fs1)[1600]: WARNING: Couldn't find device
  [/dev/mapper/cluvg1-clulv1]. Ex
  pected /dev/??? to exist
  Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
  fs1_monitor_0: not running (node=
  s1, call=27, rc=7, cib-update=40, confirmed=true)
 
  On Mon, Dec 29, 2014 at 5:38 PM, emmanuel segura emi2f...@gmail.com
 wrote:
 
  Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error:
  Permission deniedPermission deniedI/O warning : failed to load
  external entity /var/lib/pacemaker/cib/cib.xml
  Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot
  link /var/lib/pacemaker/cib/cib.xml to
  /var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1)
 
  2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com:
   Hi,
  
   You have  a problem with the cluster stonithd:error: crm_abort:
   crm_glib_handler: Forked child 6186 to record non-fatal assert at
   logging.c:73 
  
   Try to post your cluster version(packages), maybe someone can tell you
   if this is a known bug or new.
  
  
  
   2014-12-29 10:29 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   ok, sorry for that.. please use this instead.
  
   http://pastebin.centos.org/14771/
  
   thanks.
  
   On Mon, Dec 29, 2014 at 5:25 PM, emmanuel segura emi2f...@gmail.com
 
  wrote:
  
   Sorry,
  
   But your paste is empty.
  
   2014-12-29 10:19 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
hi,
   
uploaded it here.
   
http://susepaste.org/45413433
   
thanks.
   
On Mon, Dec 29, 2014 at 5:09 PM, Marlon Guao 
 marlon.g...@gmail.com
   wrote:
   
Ok, i attached the log file of one of the nodes.
   
On Mon, Dec 29, 2014 at 4:42 PM, emmanuel segura 
  emi2f

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

https://bugzilla.redhat.com/show_bug.cgi?id=1127289#c4
https://bugzilla.redhat.com/show_bug.cgi?id=1127289

2014-12-29 11:57 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 here it is..


 ==Dumping header on disk /dev/mapper/sbd
 Header version : 2.1
 UUID   : 36074673-f48e-4da2-b4ee-385e83e6abcc
 Number of slots: 255
 Sector size: 512
 Timeout (watchdog) : 5
 Timeout (allocate) : 2
 Timeout (loop) : 1
 Timeout (msgwait)  : 10

 On Mon, Dec 29, 2014 at 6:42 PM, emmanuel segura emi2f...@gmail.com wrote:

 Dlm isn't the problem, but i think is your fencing, when you powered
 off the active node, the dead remain in unclean state? can you show me
 your sbd timeouts? sbd -d /dev/path_of_your_device dump

 Thanks

 2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  Hi,
 
  ah yeah.. tried to poweroff the active node.. and tried pvscan on the
  passive.. and yes.. it didn't worked --- it doesn't return to the shell.
  So, the problem is on DLM?
 
  On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com
 wrote:
 
  Power off the active node and after one seconde try to use one lvm
  command, for example pvscan, if this command doesn't response is
  because dlm relay on cluster fencing, if the cluster fencing doesn't
  work the dlm state in blocked state.
 
  2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   perhaps, we need to focus on this message. as mentioned.. the cluster
 is
   working fine under normal circumstances. my only concern is that, LVM
   resource agent doesn't try to re-activate the VG on the passive node
 when
   the active node goes down ungracefully (powered off). Hence, it could
 not
   mount the filesystems.. etc.
  
  
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   sbd_monitor_0: not running (node=
   s1, call=5, rc=7, cib-update=35, confirmed=true)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   13: monitor dlm:0_monitor_0
   on s2
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   5: monitor dlm:1_monitor_0 o
   n s1 (local)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   dlm_monitor_0: not running (node=
   s1, call=10, rc=7, cib-update=36, confirmed=true)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   14: monitor clvm:0_monitor_0
on s2
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   6: monitor clvm:1_monitor_0
   on s1 (local)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   clvm_monitor_0: not running (node
   =s1, call=15, rc=7, cib-update=37, confirmed=true)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   15: monitor cluIP_monitor_0
   on s2
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   7: monitor cluIP_monitor_0 o
   n s1 (local)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   cluIP_monitor_0: not running (nod
   e=s1, call=19, rc=7, cib-update=38, confirmed=true)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   16: monitor vg1_monitor_0 on
s2
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   8: monitor vg1_monitor_0 on
   s1 (local)
   Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is not
   available (stopped)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   vg1_monitor_0: not running (node=
   s1, call=23, rc=7, cib-update=39, confirmed=true)
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   17: monitor fs1_monitor_0 on
s2
   Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command: Initiating
  action
   9: monitor fs1_monitor_0 on
   s1 (local)
   Dec 29 17:12:26 s1 Filesystem(fs1)[1600]: WARNING: Couldn't find
 device
   [/dev/mapper/cluvg1-clulv1]. Ex
   pected /dev/??? to exist
   Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event: Operation
   fs1_monitor_0: not running (node=
   s1, call=27, rc=7, cib-update=40, confirmed=true)
  
   On Mon, Dec 29, 2014 at 5:38 PM, emmanuel segura emi2f...@gmail.com
  wrote:
  
   Dec 27 15:38:00 s1 cib[1514]:error: crm_xml_err: XML Error:
   Permission deniedPermission deniedI/O warning : failed to load
   external entity /var/lib/pacemaker/cib/cib.xml
   Dec 27 15:38:00 s1 cib[1514]:error: write_cib_contents: Cannot
   link /var/lib/pacemaker/cib/cib.xml to
   /var/lib/pacemaker/cib/cib-0.raw: Operation not permitted (1)
  
   2014-12-29 10:33 GMT+01:00 emmanuel segura emi2f...@gmail.com:
Hi,
   
You have  a problem with the cluster stonithd:error: crm_abort:
crm_glib_handler: Forked child 6186 to record non-fatal assert at
logging.c:73 
   
Try to post your cluster version(packages), maybe someone can tell
 you
if this is a known bug or new

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-29 Thread emmanuel segura

you have quorum-policy=ignore, in the thread you posted:
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing
Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status
1084811078 receive -125 from 1084811079
walltime 1416819310 local 747

{lvm}-{clvmd}-{dlm}-{fencing} = if fencing isn't working :) your
cluster will be broken.

2014-12-29 15:46 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 looks like it's similar to this as well.

 http://comments.gmane.org/gmane.linux.highavailability.pacemaker/22398

 but, could it be because, clvm is not activating the vg on the passive
 node, because it's waiting for quorum?

 seeing this on the log as well.

 Dec 29 21:18:09 s2 dlm_controld[1776]: 8544 fence work wait for quorum
 Dec 29 21:18:12 s2 dlm_controld[1776]: 8547 clvmd wait for quorum



 On Mon, Dec 29, 2014 at 9:24 PM, Marlon Guao marlon.g...@gmail.com wrote:

 interesting, i'm using the newer pacemaker version..

 pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64


 On Mon, Dec 29, 2014 at 8:11 PM, emmanuel segura emi2f...@gmail.com
 wrote:

 https://bugzilla.redhat.com/show_bug.cgi?id=1127289#c4
 https://bugzilla.redhat.com/show_bug.cgi?id=1127289

 2014-12-29 11:57 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
  here it is..
 
 
  ==Dumping header on disk /dev/mapper/sbd
  Header version : 2.1
  UUID   : 36074673-f48e-4da2-b4ee-385e83e6abcc
  Number of slots: 255
  Sector size: 512
  Timeout (watchdog) : 5
  Timeout (allocate) : 2
  Timeout (loop) : 1
  Timeout (msgwait)  : 10
 
  On Mon, Dec 29, 2014 at 6:42 PM, emmanuel segura emi2f...@gmail.com
 wrote:
 
  Dlm isn't the problem, but i think is your fencing, when you powered
  off the active node, the dead remain in unclean state? can you show me
  your sbd timeouts? sbd -d /dev/path_of_your_device dump
 
  Thanks
 
  2014-12-29 11:02 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
   Hi,
  
   ah yeah.. tried to poweroff the active node.. and tried pvscan on the
   passive.. and yes.. it didn't worked --- it doesn't return to the
 shell.
   So, the problem is on DLM?
  
   On Mon, Dec 29, 2014 at 5:51 PM, emmanuel segura emi2f...@gmail.com
 
  wrote:
  
   Power off the active node and after one seconde try to use one lvm
   command, for example pvscan, if this command doesn't response is
   because dlm relay on cluster fencing, if the cluster fencing doesn't
   work the dlm state in blocked state.
  
   2014-12-29 10:43 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
perhaps, we need to focus on this message. as mentioned.. the
 cluster
  is
working fine under normal circumstances. my only concern is that,
 LVM
resource agent doesn't try to re-activate the VG on the passive
 node
  when
the active node goes down ungracefully (powered off). Hence, it
 could
  not
mount the filesystems.. etc.
   
   
Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event:
 Operation
sbd_monitor_0: not running (node=
s1, call=5, rc=7, cib-update=35, confirmed=true)
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
13: monitor dlm:0_monitor_0
on s2
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
5: monitor dlm:1_monitor_0 o
n s1 (local)
Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event:
 Operation
dlm_monitor_0: not running (node=
s1, call=10, rc=7, cib-update=36, confirmed=true)
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
14: monitor clvm:0_monitor_0
 on s2
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
6: monitor clvm:1_monitor_0
on s1 (local)
Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event:
 Operation
clvm_monitor_0: not running (node
=s1, call=15, rc=7, cib-update=37, confirmed=true)
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
15: monitor cluIP_monitor_0
on s2
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
7: monitor cluIP_monitor_0 o
n s1 (local)
Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event:
 Operation
cluIP_monitor_0: not running (nod
e=s1, call=19, rc=7, cib-update=38, confirmed=true)
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
16: monitor vg1_monitor_0 on
 s2
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action
8: monitor vg1_monitor_0 on
s1 (local)
Dec 29 17:12:26 s1 LVM(vg1)[1583]: WARNING: LVM Volume cluvg1 is
 not
available (stopped)
Dec 29 17:12:26 s1 crmd[1495]:   notice: process_lrm_event:
 Operation
vg1_monitor_0: not running (node=
s1, call=23, rc=7, cib-update=39, confirmed=true)
Dec 29 17:12:26 s1 crmd[1495]:   notice: te_rsc_command:
 Initiating
   action

Re: [Linux-HA] pacemaker/heartbeat LVM

2014-12-28 Thread emmanuel segura

logs?

2014-12-29 6:54 GMT+01:00 Marlon Guao marlon.g...@gmail.com:
 Hi,

 just want to ask regarding the LVM resource agent on pacemaker/corosync.

 I setup 2 nodes cluster (opensuse13.2 -- my config below). The cluster
 works as expected, like doing a manual failover (via crm resource move),
 and automatic failover (by rebooting the active node for instance). But, if
 i try to just shutoff the active node (it's a VM, so I can do a
 poweroff). The resources won't be able to failover to the passive node.
 when I did an investigation, it's due to an LVM resource not starting
 (specifically, the VG). I found out that the LVM resource won't try to
 activate the volume group in the passive node. Is this an expected
 behaviour?

 what I really expect is that, in the event that the active node be shutoff
 (by a power outage for instance), all resources should be failover
 automatically to the passive. LVM should re-activate the VG.


 here's my config.

 node 1: s1
 node 2: s2
 primitive cluIP IPaddr2 \
 params ip=192.168.13.200 cidr_netmask=32 \
 op monitor interval=30s
 primitive clvm ocf:lvm2:clvmd \
 params daemon_timeout=30 \
 op monitor timeout=90 interval=30
 primitive dlm ocf:pacemaker:controld \
 op monitor interval=60s timeout=90s on-fail=ignore \
 op start interval=0 timeout=90
 primitive fs1 Filesystem \
 params device=/dev/mapper/cluvg1-clulv1 directory=/data fstype=btrfs
 primitive mariadb mysql \
 params config=/etc/my.cnf
 primitive sbd stonith:external/sbd \
 op monitor interval=15s timeout=60s
 primitive vg1 LVM \
 params volgrpname=cluvg1 exclusive=yes \
 op start timeout=10s interval=0 \
 op stop interval=0 timeout=10 \
 op monitor interval=10 timeout=30 on-fail=restart depth=0
 group base-group dlm clvm
 group rgroup cluIP vg1 fs1 mariadb \
 meta target-role=Started
 clone base-clone base-group \
 meta interleave=true target-role=Started
 property cib-bootstrap-options: \
 dc-version=1.1.12-1.1.12.git20140904.266d5c2 \
 cluster-infrastructure=corosync \
 no-quorum-policy=ignore \
 last-lrm-refresh=1419514875 \
 cluster-name=xxx \
 stonith-enabled=true
 rsc_defaults rsc-options: \
 resource-stickiness=100

 --
 import this
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Oracle OCF Script throws SP2-0640: Not connected

2014-08-05 Thread emmanuel segura

i'm using resource-agents-3.9.2-0.25.5 on Suse 11 Sp2 and i don't have any 

. /usr/lib/ocf/lib/heartbeat/ora-common.sh maybe you need to create a
new database user

++ local 'conn_s=connect OCFMON/OCFMON'
++ shift 1
++ local func
++ echo 'connect OCFMON/OCFMON'


2014-08-01 18:40 GMT+02:00 Wendt Christian christian.we...@bosch-si.com:
 Hello *,

 i did a lot of research but i´m not able to get the purpose whether our 
 oracle ressource fails on Wednesday.

 Oracle start will fail with the message:

 /usr/lib/ocf/resource.d/heartbeat/oracle start
 INFO: orcSNBGW instance state is not OPEN (dbstat output: SP2-0640: Not 
 connected)
 ERROR: oracle instance orcSNBGW not started:

 Showdbstat throws:

 /usr/lib/ocf/resource.d/heartbeat/oracle showdbstat
 Full output:
 SP2-0640: Not connected
 Stripped output:
 OPEN
 So the first method of showdbstat monitoring the DB fails, but the second one 
 succeeds.

 It is not possible to start oracle within the pacemaker cluster anymore. 
 Everytime we start it, it´ll fail. I´ve attached the bash output while 
 starting oracle with the ocf script.

 Database and OS is fine. Nothing changed in the last days.

 Do have any ideas?

 Thank you in advance.

 Mit freundlichen Grüßen / Best regards

 Christian Wendt

 Bosch Software Innovations GmbH
 Information Technology (INST/PRV3-IT)
 Schöneberger Ufer 89-91
 10785 Berlin
 GERMANY
 www.bosch-si.de
 www.blog.bosch-si.com

 Tel. +49 30.72 61 12-308
 Fax +49 30.72.61.12-100
 christian.we...@bosch-si.com

 Registered office: Berlin, Register court: Amtsgericht Charlottenburg, HRB 
 148411 B
 Executives: Rainer Kallenbach; Thomas Cotic, Michael Hahn, Klaus Hüftle


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Managed Failovers w/ NFS HA Cluster

2014-07-22 Thread emmanuel segura

but the nfs failover works now?

2014-07-22 2:10 GMT+02:00 Charles Taylor chas...@ufl.edu:

 On Jul 21, 2014, at 10:40 AM, Charles Taylor wrote:

 As I write this, I'm thinking that perhaps the way to achieve this is to 
 change the order of the services so that the VIP is started last and stopped 
 first when stopping/starting the resource group.   That should make it 
 appear to the client that the server just went away as would happen in a 
 failure scenario.   Then the client should not know that the file system has 
 been unexported since it can't talk to the server.

 Perhaps, I just made a rookie mistake in the ordering of the services within 
 the resource group.  I'll try that and report back.

 Yep, this was my mistake.   The IPaddr2 primitive needs to follow the 
 exportfs primitives in my resource group so they now are arranged as

 Resource Group: grp_b3v0
  vg_b3v0(ocf::heartbeat:LVM) Started
  fs_b3v0(ocf::heartbeat:Filesystem) Started
  ex_b3v0_1  (ocf::heartbeat:exportfs) Started
  ex_b3v0_2  (ocf::heartbeat:exportfs) Started
  ex_b3v0_3  (ocf::heartbeat:exportfs) Started
  ex_b3v0_4  (ocf::heartbeat:exportfs) Started
  ex_b3v0_5  (ocf::heartbeat:exportfs) Started
  ex_b3v0_6  (ocf::heartbeat:exportfs) Started
  ip_vbio3   (ocf::heartbeat:IPaddr2) Started

 Thanks to those who responded,

 Charlie Taylor
 UF Research Computing



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD on CentOS7

2014-07-18 Thread emmanuel segura

depmod -a  modprobe drbd ?

2014-07-18 13:05 GMT+02:00 willi.feh...@t-online.de willi.feh...@t-online.de:
 Hello,

 I'm trying to use DRBD on CentOS7. It looks like RedHat hasn't compiled DRBD 
 into the Kernel.
 So I downloaded the source rpm from Fedora 19 and created my own rpm.

 [root@centos7 ~]# rpm -qa | grep drbd
 drbd-utils-8.4.3-2.el7.centos.x86_64
 drbd-8.4.3-2.el7.centos.x86_64
 drbd-udev-8.4.3-2.el7.centos.x86_64

 But I cannot load the drbd kernel module:

 [root@centos7 ~]# modprobe drbd
 modprobe: FATAL: Module drbd not found.

 Regards - Willi


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD on CentOS7

2014-07-18 Thread emmanuel segura

rpm -ql drbd-8.4.3-2.el7.centos.x86_64

2014-07-18 16:31 GMT+02:00 Alessandro Baggi alessandro.ba...@gmail.com:
 I'm new to CentOS, and more new on CentOS 7.
 Maybe you have not compiled drbd module. Reading on drbd site you must
 prepare kernel source tree and supply --with-km to compile also kernel
 module.

 I'm running CentOS 6.5, and for drbd-suite I use elrepo (supports also el7)
 and works very good.

 http://elrepo.org/tiki/tiki-index.php




 depmod -a  modprobe drbd ?

 2014-07-18 13:05 GMT+02:00 willi.feh...@t-online.de
 willi.feh...@t-online.de:

 Hello,

 I'm trying to use DRBD on CentOS7. It looks like RedHat hasn't compiled
 DRBD into the Kernel.
 So I downloaded the source rpm from Fedora 19 and created my own rpm.

 [root@centos7 ~]# rpm -qa | grep drbd
 drbd-utils-8.4.3-2.el7.centos.x86_64
 drbd-8.4.3-2.el7.centos.x86_64
 drbd-udev-8.4.3-2.el7.centos.x86_64

 But I cannot load the drbd kernel module:

 [root@centos7 ~]# modprobe drbd
 modprobe: FATAL: Module drbd not found.

 Regards - Willi


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] troublesome DRBD resources on CentOS 6.5

2014-06-05 Thread emmanuel segura

no logs!

2014-06-05 14:56 GMT+02:00 Bart Coninckx bart.conin...@telenet.be:
 Hi all,

 I have some DRBD resources on CentoOS 6.5 which refuse to start. A message I 
 get in Hawk and in /var/log/messages is:


 Failed op: node=storage3, resource=p_drbd_ws021, call-id=73, 
 operation=monitor, rc-code=6


 I am able to start the DRBD resources manually.
 I figured out that code 6 means configuration error, but I don't see where.
 DRBD and it's resource agent are installed from source (drbd-8.4.4).

 This is the relevant cluster configuration, which I took from the Clusters 
 from Scratch document:


 primitive p_drbd_ws021 ocf:linbit:drbd \
 params drbd_resource=ws021 drbdconf=/etc/drbd.conf \
 op monitor interval=60 timeout=20
 ms ms_drbd_ws021 p_drbd_ws021 \
 meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
 notify=true target-role=Started


 Any tips or hints are most welcome, because I have been looking at this thing 
 for two days and still no progress,

 thanks!

 BC
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Packemaker resources for Galera cluster

2014-06-05 Thread emmanuel segura

If you have the ClustetIP resource in g_mysql, i think you don't need
order order_mysql_before_ip Mandatory: p_mysql ClusterIP because the
group is ordered by default, if you wanna mysql running on all boxes,
use clone resource and a colocation constraint to put ip on a box with
a mysql instance actived

2014-06-05 6:48 GMT+02:00 Razvan Oncioiu ronci...@gmail.com:
 Hello,

 I can't seem to find  a proper way of setting up resources in pacemaker to
 manager my Galera cluster. I want a VIP that will failover betwen 5 boxes (
 this works ), but I would also like to tie this into a resources that
 monitors mysql as well. if a mysql instance goes down, the VIP should move
 to another box that has mysql actually running. But I do not want pacemaker
 to start or stop the mysql service. Here is my current configuration:

 node galera01
 node galera02
 node galera03
 node galera04
 node galera05
 primitive ClusterIP IPaddr2 \
 params ip=10.10.10.178 cidr_netmask=24 \
 meta is-managed=true \
 op monitor interval=5s
 primitive p_mysql mysql \
 params pid=/var/lib/mysql/mysqld.pid test_user=root
 test_passwd=goingforbroke \
 meta is-managed=false \
 op monitor interval=5s OCF_CHECK_LEVEL=10 \
 op start interval=0 timeout=60s \
 op stop interval=0 timeout=60s on-fail=standby
 group g_mysql p_mysql ClusterIP
 order order_mysql_before_ip Mandatory: p_mysql ClusterIP
 property cib-bootstrap-options: \
 dc-version=1.1.10-14.el6_5.3-368c726 \
 cluster-infrastructure=classic openais (with plugin) \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 expected-quorum-votes=5 \
 last-lrm-refresh=1401942846
 rsc_defaults rsc-options: \
 resource-stickiness=100




 --
 View this message in context: 
 http://linux-ha.996297.n3.nabble.com/Packemaker-resources-for-Galera-cluster-tp15668.html
 Sent from the Linux-HA mailing list archive at Nabble.com.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-23 Thread emmanuel segura

the first thing, you are using no_path_retry in wrong way in your
multipath, try to read this
http://www.novell.com/documentation/oes2/clus_admin_lx/data/bl9ykz6.html


2014-04-22 20:41 GMT+02:00 Tom Parker tpar...@cbnco.com:

 I have attached the config files to this e-mail.  The sbd dump is below

 [LIVE] qaxen1:~ # sbd -d /dev/mapper/qa-xen-sbd dump
 ==Dumping header on disk /dev/mapper/qa-xen-sbd
 Header version : 2.1
 UUID   : ae835596-3d26-4681-ba40-206b4d51149b
 Number of slots: 255
 Sector size: 512
 Timeout (watchdog) : 45
 Timeout (allocate) : 2
 Timeout (loop) : 1
 Timeout (msgwait)  : 90
 ==Header on disk /dev/mapper/qa-xen-sbd is dumped

 On 22/04/14 02:30 PM, emmanuel segura wrote:
  you are missingo cluster configuration and sbd configuration and
 multipath
  config
 
 
  2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com:
 
  Has anyone seen this?  Do you know what might be causing the flapping?
 
  Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
  Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
  /dev/mapper/qa-xen-sbd
  Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
  Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd
  uuid: ae835596-3d26-4681-ba40-206b4d51149b
  Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS
  quorum check enabled
  Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
  cluster ...
  Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
  /dev/watchdog
  Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
  seconds.
  Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
  cluster ...
  Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
  Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
  Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
  Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
  Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
  Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
  Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online
  Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online
  Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
  Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
  Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
  UNHEALTHY
  Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online
  Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK

Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-22 Thread emmanuel segura

you are missingo cluster configuration and sbd configuration and multipath
config


2014-04-22 20:21 GMT+02:00 Tom Parker tpar...@cbnco.com:

 Has anyone seen this?  Do you know what might be causing the flapping?

 Apr 21 22:03:03 qaxen6 sbd: [12962]: info: Watchdog enabled.
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Servant starting for device
 /dev/mapper/qa-xen-sbd
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Monitoring Pacemaker health
 Apr 21 22:03:03 qaxen6 sbd: [12973]: info: Device /dev/mapper/qa-xen-sbd
 uuid: ae835596-3d26-4681-ba40-206b4d51149b
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Legacy plug-in detected, AIS
 quorum check enabled
 Apr 21 22:03:03 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:04 qaxen6 sbd: [12971]: notice: Using watchdog device:
 /dev/watchdog
 Apr 21 22:03:04 qaxen6 sbd: [12971]: info: Set watchdog timeout to 45
 seconds.
 Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...
 Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
 Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
 Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:03:09 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
 Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:15:01 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:16:37 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:25:08 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:26:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:26:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 21 22:39:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:39:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 21 22:42:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:42:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 01:36:24 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 01:36:24 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 01:36:34 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 01:36:34 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 06:53:15 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 06:53:15 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 06:54:03 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 06:54:03 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 09:57:21 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 09:57:21 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 09:58:12 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 09:58:12 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 10:59:49 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 10:59:49 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:00:41 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:00:41 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 11:50:55 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 11:50:55 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 11:51:06 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 11:51:06 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:09:12 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:09:12 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:09:35 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:09:35 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:31:35 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:31:35 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:31:44 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:31:44 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:32:52 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:32:52 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:33:01 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:33:01 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 13:44:39 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 13:44:39 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 13:44:47 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 13:44:47 qaxen6 sbd: [12971]: info: Pacemaker health check: OK
 Apr 22 14:07:42 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 22 14:07:42 qaxen6 sbd: [12971]: WARN: Pacemaker health check:
 UNHEALTHY
 Apr 22 14:07:51 qaxen6 sbd: [12974]: info: Node state: online
 Apr 22 14:07:51 qaxen6 sbd: [12971]: info: Pacemaker

Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA

2013-12-02 Thread emmanuel segura

The idea behind use exclusive volume activation mode with clvmd was(i
think), have a vg active active and lvs opens just in one node, more lvm
metadata replicated on all cluster nodes, when you'll do a change like lvm
resize.

I have a redhat cluster with clvmd with vg active in exclusive mode, if you
add pv to your volume group, every cluster node knows about the new pv in
the vg, but anyway you cannot active the vg if it's active on other node, i
think clvmd is needed just for replicate the lvm metadata


2013/12/2 Ulrich Windl ulrich.wi...@rz.uni-regensburg.de

  Lars Marowsky-Bree l...@suse.com schrieb am 29.11.2013 um 13:48 in
 Nachricht
 20131129124833.gf22...@suse.de:
  On 2013-11-29T13:46:17, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de
 wrote:
 
   I just did s/true/false/...
  
   Was that a clustered volume?
  Clusterd  exclusive=true ??
 
  No!
 
  Then it can't work. Exclusive activation only works for clustered volume
  groups, since it uses the DLM to protect against the VG being activated
  more than once in the cluster.
 Hi!

 Try it with resource-agents-3.9.4-0.26.84: it works; with
 resource-agents-3.9.5-0.6.26.11 it doesn't work ;-)

 You could argue that it never should have worked. Anyway: If you want to
 activate a VG on exactly one node you should not need cLVM; only if you man
 to activate the VG on multiple nodes (as for a cluster file system)...

 Reagrds,
 Ulrich


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] trouble installing Heartbeat

2013-12-01 Thread emmanuel segura

maybe you are missing the uuid library


2013/12/1 John Williams john.1...@yahoo.com

 I'm trying to install heartbeat and I'm getting the following error with
 the cluster glue components during the make part of the build:


 /bin/sh ../../libtool --tag=CC  --tag=CC   --mode=link gcc -std=gnu99  -g
 -O2 -ggdb3 -O0  -fgnu89-inline -fstack-protector-all -Wall
 -Waggregate-return -Wbad-function-cast -Wcast-qual -Wcast-align
 -Wdeclaration-after-statement -Wendif-labels -Wfloat-equal -Wformat=2
 -Wformat-security -Wformat-nonliteral -Winline -Wmissing-prototypes
 -Wmissing-declarations -Wmissing-format-attribute -Wnested-externs
 -Wno-long-long -Wno-strict-aliasing -Wpointer-arith -Wstrict-prototypes
 -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror   -o ipctest
 ipctest.o libplumb.la ../../replace/libreplace.la  ../../lib/pils/
 libpils.la -lbz2 -lxml2 -lc -lrt -ldl  -lglib-2.0   -lltdl
 libtool: link: gcc -std=gnu99 -g -O2 -ggdb3 -O0 -fgnu89-inline
 -fstack-protector-all -Wall -Waggregate-return -Wbad-function-cast
 -Wcast-qual -Wcast-align -Wdeclaration-after-statement -Wendif-labels
 -Wfloat-equal -Wformat=2 -Wformat-security -Wformat-nonliteral -Winline
 -Wmissing-prototypes -Wmissing-declarations -Wmissing-format-attribute
 -Wnested-externs -Wno-long-long -Wno-strict-aliasing -Wpointer-arith
 -Wstrict-prototypes -Wwrite-strings -ansi -D_GNU_SOURCE -DANSI_ONLY -Werror
 -o .libs/ipctest ipctest.o  ./.libs/libplumb.so
 /home/ssaleh/Reusable-Cluster-Components-glue--glue-1.0.9/lib/pils/.libs/libpils.so
 ../../replace/.libs/libreplace.a ../../lib/pils/.libs/libpils.so -lbz2
 -lxml2 -lc -lrt -ldl -lglib-2.0 -lltdl
 ./.libs/libplumb.so: undefined reference to `uuid_parse'
 ./.libs/libplumb.so: undefined reference to `uuid_generate'
 ./.libs/libplumb.so: undefined reference to `uuid_copy'
 ./.libs/libplumb.so: undefined reference to `uuid_is_null'
 ./.libs/libplumb.so: undefined reference to `uuid_unparse'
 ./.libs/libplumb.so: undefined reference to `uuid_clear'
 ./.libs/libplumb.so: undefined reference to `uuid_compare'
 collect2: ld returned 1 exit status
 gmake[2]: *** [ipctest] Error 1
 gmake[2]: Leaving directory
 `/home/john/Reusable-Cluster-Components-glue--glue-1.0.9/lib/clplumbing'
 gmake[1]: *** [all-recursive] Error 1
 gmake[1]: Leaving directory
 `/home/john/Reusable-Cluster-Components-glue--glue-1.0.9/lib'
 make: *** [all-recursive] Error 1
 [root@bigb1 Reusable-Cluster-Components-glue--glue-1.0.9]#

 How do I resolve this?   Thanks in advance.

 J.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman-controlled cluster takes an hour to start !?

2013-08-23 Thread emmanuel segura

Put your cluster node hostnames in /etc/hosts and i think you are missing
cman two_node=1 expected_votes=1/ in cluster.conf


2013/8/23 Jakob Curdes j...@info-systems.de

 Hmmm, the problem turns out to DNS-related. At startup, some of the
 virtual interfaces are inactive and the DNS servers are unreachable. And
 CMAN seems to do a lookup for all ip addresses on the machine; I have the
 names of all cluster members in the hosts file but not all names of all
 other addresses (i.e. the ones managed by the cluster). Anyway I wonder
 whiy even with -d64 it doesn't tell me anything about what it is doing. I
 think the timespan of an hour is just because we have lots of VLAN
 interfaces the he wants to get a DNS name for

 Regards,
 Jakob Curdes


 __**_
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: 
 http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Replacement of heartbeat on fc18 ?

2013-08-23 Thread emmanuel segura

yum install corosync pacemaker


2013/8/23 Francis SOUYRI francis.sou...@apec.fr

 Hi,

 Thank you but I do not find on yum OpenAIS or an rpm for OpenAIS for fc18,
 do you know where I can search.

 Best regards.

 Francis


 On 08/23/2013 03:58 PM, Nick Cameo wrote:

 Pacemaker+Corosync/OpenAIS

 http://clusterlabs.org/

 N.

 __**_
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: 
 http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems



 __**_
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: 
 http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] resource won't run on a specific node

2013-07-27 Thread emmanuel segura

Hello

Can you show us crm configure show?
thanks


2013/7/27 Miles Fidelman mfidel...@meetinghouse.net

 Hi Folks,

 Dual-node, pacemaker cluster, DRBD-backed xen virtual machines - one of
 our VMs will run on one node, but not the other, and crm status yields a
 failure message saying that starting the resource failed for unknown
 reasons.  The log is only slightly less useless:

 (server2 and server3 are the nodes, server1 is the resource)
 server3, running server1, crashes
 node entries from server2 trying to failover the resource

 Jul 27 06:27:06 server2 pengine: [1365]: info: get_failcount: server1 has
 failed INFINITY times on server2
 Jul 27 06:27:06 server2 pengine: [1365]: WARN: common_apply_stickiness:
 Forcing server1 away from server2 after 100 failures (max=100)
 Jul 27 06:27:06 server2 pengine: [1365]: info: native_color: Resource
 server1 cannot run anywhere
 Jul 27 06:27:06 server2 pengine: [1365]: notice: LogActions: Leave
 resource server1#011(Stopped)

 Attempts to migrate the server fail with the same errors.  Failover USED
 to work just fine.  It still works for other VMs. Any idea how to track
 down what's failing?

 Thanks very much,

 Miles Fidelman


 --
 In theory, there is no difference between theory and practice.
 In practice, there is.    Yogi Berra

 __**_
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/**mailman/listinfo/linux-hahttp://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: 
 http://linux-ha.org/**ReportingProblemshttp://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Pacemaker - Resource dont get started on the standby node.

2013-06-17 Thread emmanuel segura

Hello Parkirat

Thank you very much


2013/6/17 Parkirat parkiratba...@gmail.com

 Thanks Ulrich,

 I have figured out the problem.
 The actual problem was in the configuration file for the resource httpd. It
 was correct in the Master node but the configuration was missing in the
 standby node, which was not allowing it to start.

 Regards,
 Parkirat Singh Bagga.



 --
 View this message in context:
 http://linux-ha.996297.n3.nabble.com/Pacemaker-Resource-dont-get-started-on-the-standby-node-tp14686p14695.html
 Sent from the Linux-HA mailing list archive at Nabble.com.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker - Resource dont get started on the standby node.

2013-06-16 Thread emmanuel segura

Hello Parkirat

can you share with us what was the problem? maybe this can help others
persons

Thanks


2013/6/16 Parkirat parkiratba...@gmail.com

 I figured out the problem.

 Thanks and Regards,
 Parkirat Singh Bagga.



 --
 View this message in context:
 http://linux-ha.996297.n3.nabble.com/Pacemaker-Resource-dont-get-started-on-the-standby-node-tp14686p14691.html
 Sent from the Linux-HA mailing list archive at Nabble.com.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-06 Thread emmanuel segura

group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start

should be

group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
order drbd_fs_after_drbd inf: ma-ms-drbd5:promote astorage:start



2013/6/6 Thomas Glanzmann tho...@glanzmann.de

 Hello,
 on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When
 putting multiple filesystems which depend on multiple drbd promotions,
 only the first drbd is promoted and the group never comes up. However
 when the promotions are not based on the individual filesystems but on
 the group or probably any single entity all drbds are promoted
 correctly. So to summarize:

 This only promotes the first drbd and the resource group never starts:

 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start
 #   ~~

 This works:
 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start
 #   ~~

 I would like to know if that is supposed to happen. If that is the case
 I would understand why this is the case. I assume it is a bug, but I'm
 not sure.

 Complete working config here:

 primitive astorage_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \
 op monitor interval=60s
 primitive astorage1-fencing stonith:external/ipmi \
 params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage2-fencing stonith:external/ipmi \
 params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage_16_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \
 op monitor interval=60s
 primitive drbd10 ocf:linbit:drbd \
 params drbd_resource=r10 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd10_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd10 directory=/mnt/akvm/nfs
 fstype=ext4 \
 op monitor interval=60s
 primitive drbd3 ocf:linbit:drbd \
 params drbd_resource=r3 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd4 ocf:linbit:drbd \
 params drbd_resource=r4 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5 ocf:linbit:drbd \
 params drbd_resource=r5 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd5
 directory=/mnt/apbuild/astorage/packages fstype=ext3 \
 op monitor interval=60s
 primitive drbd6 ocf:linbit:drbd \
 params drbd_resource=r6 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8 ocf:linbit:drbd \
 params drbd_resource=r8 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4
 \
 op monitor interval=60s
 primitive drbd9 ocf:linbit:drbd \
 params drbd_resource=r9 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd9_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd9 directory=/exports fstype=ext4 \
 op monitor interval=60s
 primitive nfs-common ocf:heartbeat:nfs-common \
 op monitor interval=60s
 primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \
 op monitor interval=60s
 primitive target ocf:heartbeat:target \
 op monitor interval=60s
 group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common
 nfs-kernel-server astorage_ip astorage_16_ip target \
 meta target-role=Started
 ms ma-ms-drbd10 drbd10 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd3 drbd3 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd4 drbd4 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd5 drbd5 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd6 drbd6 \
 meta master-max=1

Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-06 Thread emmanuel segura

sorry

it should be

group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
order drbd_fs_after_drbd inf: ma-ms-drbd5:promote ma-ms-drbd8:promote
astorage:start


2013/6/6 emmanuel segura emi2f...@gmail.com

 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start

 should be


 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd_fs_after_drbd inf: ma-ms-drbd5:promote astorage:start



 2013/6/6 Thomas Glanzmann tho...@glanzmann.de

 Hello,
 on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When
 putting multiple filesystems which depend on multiple drbd promotions,
 only the first drbd is promoted and the group never comes up. However
 when the promotions are not based on the individual filesystems but on
 the group or probably any single entity all drbds are promoted
 correctly. So to summarize:

 This only promotes the first drbd and the resource group never starts:

 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start
 #   ~~

 This works:
 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start
 #   ~~

 I would like to know if that is supposed to happen. If that is the case
 I would understand why this is the case. I assume it is a bug, but I'm
 not sure.

 Complete working config here:

 primitive astorage_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \
 op monitor interval=60s
 primitive astorage1-fencing stonith:external/ipmi \
 params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage2-fencing stonith:external/ipmi \
 params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage_16_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \
 op monitor interval=60s
 primitive drbd10 ocf:linbit:drbd \
 params drbd_resource=r10 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd10_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd10 directory=/mnt/akvm/nfs
 fstype=ext4 \
 op monitor interval=60s
 primitive drbd3 ocf:linbit:drbd \
 params drbd_resource=r3 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd4 ocf:linbit:drbd \
 params drbd_resource=r4 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5 ocf:linbit:drbd \
 params drbd_resource=r5 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd5
 directory=/mnt/apbuild/astorage/packages fstype=ext3 \
 op monitor interval=60s
 primitive drbd6 ocf:linbit:drbd \
 params drbd_resource=r6 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8 ocf:linbit:drbd \
 params drbd_resource=r8 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd8 directory=/mnt/akvm/vms
 fstype=ext4 \
 op monitor interval=60s
 primitive drbd9 ocf:linbit:drbd \
 params drbd_resource=r9 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd9_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd9 directory=/exports fstype=ext4 \
 op monitor interval=60s
 primitive nfs-common ocf:heartbeat:nfs-common \
 op monitor interval=60s
 primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \
 op monitor interval=60s
 primitive target ocf:heartbeat:target \
 op monitor interval=60s
 group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common
 nfs-kernel-server astorage_ip astorage_16_ip target \
 meta target-role=Started
 ms ma-ms-drbd10 drbd10 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd3 drbd3 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd4 drbd4 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone

Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-06 Thread emmanuel segura

Hello Thomas

Sorry i can't give you any explain, because i don't see any sense in your
config

Sorry


2013/6/6 Thomas Glanzmann tho...@glanzmann.de

 Hello,
 on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When
 putting multiple filesystems which depend on multiple drbd promotions,
 only the first drbd is promoted and the group never comes up. However
 when the promotions are not based on the individual filesystems but on
 the group or probably any single entity all drbds are promoted
 correctly. So to summarize:

 This only promotes the first drbd and the resource group never starts:

 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start
 #   ~~

 This works:
 group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
 order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start
 order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start
 #   ~~

 I would like to know if that is supposed to happen. If that is the case
 I would understand why this is the case. I assume it is a bug, but I'm
 not sure.

 Complete working config here:

 primitive astorage_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \
 op monitor interval=60s
 primitive astorage1-fencing stonith:external/ipmi \
 params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage2-fencing stonith:external/ipmi \
 params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN
 passwd=secret \
 op monitor interval=60s
 primitive astorage_16_ip ocf:heartbeat:IPaddr2 \
 params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \
 op monitor interval=60s
 primitive drbd10 ocf:linbit:drbd \
 params drbd_resource=r10 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd10_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd10 directory=/mnt/akvm/nfs
 fstype=ext4 \
 op monitor interval=60s
 primitive drbd3 ocf:linbit:drbd \
 params drbd_resource=r3 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd4 ocf:linbit:drbd \
 params drbd_resource=r4 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5 ocf:linbit:drbd \
 params drbd_resource=r5 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd5_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd5
 directory=/mnt/apbuild/astorage/packages fstype=ext3 \
 op monitor interval=60s
 primitive drbd6 ocf:linbit:drbd \
 params drbd_resource=r6 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8 ocf:linbit:drbd \
 params drbd_resource=r8 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd8_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4
 \
 op monitor interval=60s
 primitive drbd9 ocf:linbit:drbd \
 params drbd_resource=r9 \
 op monitor interval=29s role=Master \
 op monitor interval=31s role=Slave
 primitive drbd9_fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd9 directory=/exports fstype=ext4 \
 op monitor interval=60s
 primitive nfs-common ocf:heartbeat:nfs-common \
 op monitor interval=60s
 primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \
 op monitor interval=60s
 primitive target ocf:heartbeat:target \
 op monitor interval=60s
 group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common
 nfs-kernel-server astorage_ip astorage_16_ip target \
 meta target-role=Started
 ms ma-ms-drbd10 drbd10 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd3 drbd3 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd4 drbd4 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd5 drbd5 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd6 drbd6 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd8 drbd8 \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true target-role=Started
 ms ma-ms-drbd9 drbd9 \
 meta master-max=1

Re: [Linux-HA] Network failover and communication channel survival

2013-04-30 Thread emmanuel segura

maybe you can use openvswitch


2013/4/30 Lang, David david_l...@intuit.com

 I've thought about this for a few years, but have not yet implemented it.

 What I would look at is setting up a new virtual network that trunks your
 two physical networks together and you can then use the IP on that trunk
 for your communication.

 David Lang


 Richard Comblen richard.comb...@kreios.lu wrote:


 Hi all,

 I have a two node setup with replicated PostgreSQL DB (master/slave
 setup). Focus is on keeping the system up and running, not on
 capacity.
 All good, that works fine.

 Now, a new requirement shows up: the two nodes should be connected
 using two physically separated networks, and should survive failure of
 one of the two networks.

 The two nodes communicate together for PostgreSQL replication.

 Initially, the slave will communicate with the master on network 1,
 and if network 1 fails, it should switch to network 2.

 Obviously, I cannot use a virtual-ip over two different networks. What
 would be the closest solution in term of features ?

 I was thinking about having a resource agent managing a port forward
 rule on slave node, something like localhost:5433 =
 master_ip_on_network1:5432, that would switch to localhost:5433 =
 master_ip_on_network2:5432, so that it would be transparent for the
 replication tool.

 Do you know if such a resource agent is already implemented somewhere ?

 Do you have remarks, comments about such a setup ?

 Do you have suggestion on a better way to achieve these requirements ?

 Thanks,

 Richard
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Is CLVM really needed in an active/passive cluster?

2013-04-22 Thread emmanuel segura

Hello Angel

In this thread
http://comments.gmane.org/gmane.linux.redhat.release.rhel5/6395  you can
find the answer to your question

Thanks


2013/4/22 Angel L. Mateo ama...@um.es

 Hello,

 I'm deploying a clustered pop/imap server with mailboxes stored in
 a
 SAN connected with fibre channel.

 The problem I have is that I have firstly configured the cluster
 with
 CLVM, but with this I can't create snapshots of my volumes, which is
 required for backups.

 But is this CLVM really necessary? Or it is enough to configure LVM
 with fencing and stonith?

 --
 Angel L. Mateo Martínez
 Sección de Telemática
 Área de Tecnologías de la Información
 y las Comunicaciones Aplicadas (ATICA)
 http://www.um.es/atica
 Tfo: 868889150
 Fax: 86337
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-09 Thread emmanuel segura

create a partition on /dev/sdd and you that


2013/4/9 Fredrik Hudner fredrik.hud...@gmail.com

 Hi,

 I have a (for now) two node HA cluster with sbd as stonith mechanism.
 I have followed the installation and configuration of sbd from
 http://www.linux-ha.org/wiki/SBD_Fencing.
 For one reason or another stonith won't start and messages log says at one
 point:

 stonith-ng[20383]:   notice: stonith_device_action: Device stonith_sbd not
 found.
 stonith-ng[30234]: info: stonith_command: Processed st_execute from
 lrmd: rc=-12
 stonith-ng[30234]: info: stonith_device_register: Added 'stonith_sbd'
 to the device list (1 active devices)
 stonith-ng[30234]: info: stonith_command: Processed st_device_register
 from lrmd: rc=0
 stonith-ng[30234]: info: stonith_command: Processed st_execute from
 lrmd: rc=-1
 stonith-ng[30234]:   notice: log_operation: Operation 'monitor' [30502] for
 device 'stonith_sbd' returned: -2
 stonith-ng[30234]: info: stonith_device_remove: Removed 'stonith_sbd'
 from the device list (0 active devices)

 Maybe it doesn't find the sbd device /dev/sdd ?
 From the output of my logs, you can see that both machines see each others
 sbd device but stonith doesn't seem to recognize the device. Question is
 why

 Attached are all possible logs and config files from drbd, sbd, pacemaker.
 Corosync I can send if you need me too, but the post became to big with
 that included
 Kind regards
 /Fredrik

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-09 Thread emmanuel segura

Sorry

create a partition on /dev/sdd and you use that


2013/4/9 emmanuel segura emi2f...@gmail.com

 create a partition on /dev/sdd and you that


 2013/4/9 Fredrik Hudner fredrik.hud...@gmail.com

 Hi,

 I have a (for now) two node HA cluster with sbd as stonith mechanism.
 I have followed the installation and configuration of sbd from
 http://www.linux-ha.org/wiki/SBD_Fencing.
 For one reason or another stonith won't start and messages log says at one
 point:

 stonith-ng[20383]:   notice: stonith_device_action: Device stonith_sbd not
 found.
 stonith-ng[30234]: info: stonith_command: Processed st_execute from
 lrmd: rc=-12
 stonith-ng[30234]: info: stonith_device_register: Added 'stonith_sbd'
 to the device list (1 active devices)
 stonith-ng[30234]: info: stonith_command: Processed st_device_register
 from lrmd: rc=0
 stonith-ng[30234]: info: stonith_command: Processed st_execute from
 lrmd: rc=-1
 stonith-ng[30234]:   notice: log_operation: Operation 'monitor' [30502]
 for
 device 'stonith_sbd' returned: -2
 stonith-ng[30234]: info: stonith_device_remove: Removed 'stonith_sbd'
 from the device list (0 active devices)

 Maybe it doesn't find the sbd device /dev/sdd ?
 From the output of my logs, you can see that both machines see each others
 sbd device but stonith doesn't seem to recognize the device. Question is
 why

 Attached are all possible logs and config files from drbd, sbd, pacemaker.
 Corosync I can send if you need me too, but the post became to big with
 that included
 Kind regards
 /Fredrik

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




 --
 esta es mi vida e me la vivo hasta que dios quiera




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Q: corosync/TOTEM retransmit list

2013-04-03 Thread emmanuel segura

Try look here
http://www.hastexo.com/resources/hints-and-kinks/whats-totem-retransmit-list-all-about-corosync


2013/4/3 Ulrich Windl ulrich.wi...@rz.uni-regensburg.de

 Hi!

 I have a simple question: Is it possible that DLM or OCFS2 causes
 corosync/TOTEM retransmit messages? I have the feeling that whenever OCFS2
 is busy, corosync/TOTEM sends out retransmit lists like this:

 [...]
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6940d
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69410
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69413
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69415
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69417
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69419
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6941b
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6941d
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69420
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69422
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69424
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69426
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69428
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6942a
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6942c
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6942e
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69430
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69432
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69436
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69439
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6943c
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6943e
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69440
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69442
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69444
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69446
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69448
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6944a
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6944c
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6944e
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69450
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69452
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69454
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69456
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Marking ringid 1 interface
 10.2.2.1 FAULTY
 Apr  3 11:30:01 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69458
 Apr  3 11:30:02 h01 corosync[4310]:  [TOTEM ] Automatically recovered ring
 1
 Apr  3 11:30:02 h01 corosync[4310]:  [TOTEM ] Automatically recovered ring
 1
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69484
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69486
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69486
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69486
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69487
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948a
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b 6948d
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b
 Apr  3 11:30:07 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6948b
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69492
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69494
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69496
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 69499
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6949b
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6949d
 Apr  3 11:30:09 h01 corosync[4310]:  [TOTEM ] Retransmit List: 6949f
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694a3
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694a5
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694a7
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694a9
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694ab
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694ad
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694af
 Apr  3 11:30:14 h01 corosync[4310]:  [TOTEM ] Retransmit List: 694b1
 Apr  3

Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread emmanuel segura

Hello Nick

Try to use nic=eth0 instead of nic=eth0:3

thanks

2013/3/24 Nick Walke tubaguy50...@gmail.com

 Thanks for the tip, however, it did not work.  That's actually a /116.  So
 I put in 2600:3c00::0034:c007/116 and am getting the same error.  I
 requested that it restart the resource as well, just to make sure it wasn't
 the previous error.

 Nick


 On Sun, Mar 24, 2013 at 3:55 AM, Thomas Glanzmann tho...@glanzmann.de
 wrote:

  Hello,
 
   ipv6addr=2600:3c00::0034:c007
 
  from the manpage of ocf_heartbeat_IPv6addr it looks like that you have
  to specify the netmask so try:
 
  ipv6addr=2600:3c00::0034:c007/64 assuiming that you're in a /64.
 
  Cheers,
  Thomas
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem promoting Slave to Master

2013-03-15 Thread emmanuel segura

Hello Fedrik

Why you have a clone of cl_exportfs_root and you have ext4 filesystem, and
i think this order is not correct

order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
order o_root_before_nfs inf: cl_exportfs_root g_nfs:start

I think like that you try to start g_nfs twice


2013/3/14 Fredrik Hudner fredrik.hud...@evry.com

 Hi all,
 I have a problem after I removed a node with the force command from my crm
 config.
 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)

 Then I wanted to add a third node acting as quorum node, but was not able
 to get it to work (probably because I don't understand how to set it up).
 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.

 Now when I start up Pacemaker the resources doesn't look like they come up
 correctly

 Online: [ testclu01 testclu02 ]

 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
  Masters: [ testclu01 ]
  Slaves: [ testclu02 ]
 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
  Started: [ tdtestclu01 tdtestclu02 ]
 Resource Group: g_nfs
  p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
  p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01
  p_fs_shared2   (ocf::heartbeat:Filesystem):Started testclu01
  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01
 Clone Set: cl_exportfs_root [p_exportfs_root]
  Started: [ testclu01 testclu02 ]

 Failed actions:
 p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7,
 status=complete): not running
 p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
 status=complete): not running

 The filesystems mount correctly on the master at this stage and can be
 written to.
 When I stop the services on the master node for it to failover, it doesn't
 work.. Looses cluster-ip connectivity

 Corosync.log from master after I stopped pacemaker on master node :  see
 attached file

 Additional files (attached): crm-configure show
   Corosync.conf

 Global_common.conf


 I'm not sure how to proceed to get it up in a fair state now
 So if anyone could help me it would be much appreciated

 Kind regards
 /Fredrik Hudner

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Using a Ping Daemon (or Something Better) to Prevent Split Brain

2013-01-30 Thread emmanuel segura

I don't know if ping is rigth for your case, try to look here
http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html

2013/1/31 Robinson, Eric eric.robin...@psmnv.com

 We have this configuration:

 NodeA is located in DataCenterA. NodeB is located in (geographically
 separate) DataCenterB.

 DataCenterA is connected to DataCenterB through 4 redundant gigabit links
 (two physically separate Corosync rings).

 Both nodes reach the Internet through (geographically separate)
 DataCenterC.

 Is it possible to prevent cluster partition (or drbd split brain) if the
 links between DataCenterA and DataCenterB go down, but at least one node
 can still communicate with DataCenterC? Note that we have no equipment at
 DataCenterC, but we can ping stuff in it and through it.

 Ideally, I would like to prevent a secondary node from going primary if it
 cannot communicate with DataCenterC.

 --
 Eric Robinson



 Disclaimer - January 30, 2013
 This email and any files transmitted with it are confidential and intended
 solely for linux-ha@lists.linux-ha.org. If you are not the named
 addressee you should not disseminate, distribute, copy or alter this email.
 Any views or opinions presented in this email are solely those of the
 author and might not represent those of Physicians' Managed Care or
 Physician Select Management. Warning: Although Physicians' Managed Care or
 Physician Select Management has taken reasonable precautions to ensure no
 viruses are present in this email, the company cannot accept responsibility
 for any loss or damage arising from the use of this email or attachments.
 This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] resources running on both node

2012-08-07 Thread emmanuel segura

1: check if the services are configured to start at boot time
2:without info nobody can help you

2012/7/21 Chirag Vaishnav chirag.vaish...@saicare.com


 Hi,

 We are HA between two nodes, everything is configured as per standard
 example file (using haresources) and everything works well generally, but
 some time we face issue that HA is up and running on both the nodes with
 resources (IP) allocated on both node. Ideally in such case one of the node
 should leave the resources but it doesn't happen. Any suggestions would be
 helpful.


 Thanks  Regards,

 Chirag Vaishnav

 Sai Infosystem (India) Ltd.
 Corp. Office: Sai Care Super Plaza, Sandesh press road, Vastrapur
 Ahmedabad . 380054. Gujarat, India
 Phone: +91-79-30110400
 Url: www.saicare.com
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD and automatic sync

2012-08-03 Thread emmanuel segura

are you using ext3 for drbd active/active? UM

2012/8/3 Elvis Altherr elvis.alth...@gmail.com

 Hello together

 On my gentoo  servers (2 Node Cluster with kernel 3.x) i use heartbeat
 3.0.5 and DRBD 8.4.0 for block replication between the two machines
 which served apache, mysql and samba fileservices

 Everything works fine, except the automatic sync between the two drives
 wich are both primarys

 What did i wrong?


 conf files see below

 drbd.conf

 
 resource r0 {
# protocol to use; C is the the safest variant
  net {
  allow-two-primaries;
  }
  protocol C;
 startup {
 become-primary-on both;
 #timeout (in seconds) for the connection on startup
 wfc-timeout   90;
 # timeout (in seconds) for the connection on startup
 #after detection of data inconsistencies (degraded mode)
 degr-wfc-timeout  120;
 }
   syncer {
 # maximum bandwidth to use for this resource
 rate 100M;
 }
 on mail2 {
 ### options for master-server ###
 # name of the allocated blockdevice
 device /dev/drbd0;
 # underlying blockdevice
 disk   /dev/sdb1;
 #address and port to use for the synchronisation
 # here we use the heartbeat network
 address10.0.0.1:7788;
 # where to store DRBD metadata; here it's on the underlying device itself
 meta-disk  internal;
 }
 on disthost3 {
 device /dev/drbd1;
 disk /dev/sda6;
 address 10.0.0.2:7788;
 meta-disk internal;
 }


 haresoures file for heartbeat


 mail2 10.0.0.3 drbddisk::r0 Filesystem::/dev/drbd0::/drfs::ext3 apache2
 mysql bind samba

 ha.cf


   # Logging
   debug  1
   use_logd   true
 logfacility daemon

 # Misc Options
 traditional_compression off
 compression bz2
 coredumps   true
 auto_failback   on

 # Communications
 udpport 694
 #ucast  eth1 10.0.0.1
 bcast   eth1
 #autojoin   any

 # Thresholds (in seconds)
 keepalive   2
 warntime5
 deadtime15
 initdead60
 crm no
 nodemail2
 nodedisthost3
 ~


 thanks for your help

 --
 Freundliche Grüsse

 Elvis Altherr
 Brauerstrasse 83a
 9016 St. Gallen
 071 280 13 79 (Privat)
 elvis.alth...@gmail.com

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD and automatic sync

2012-08-03 Thread emmanuel segura

i know the drbd primary to primary it's for use ocfs/gfs, so for have the
filesystem read  write on both nodes, why you still using heartbeat 1.X



2012/8/3 Elvis Altherr elvis.alth...@gmail.com

 Am 03.08.2012 09:32, schrieb emmanuel segura:
  are you using ext3 for drbd active/active? UM
 
  2012/8/3 Elvis Altherr elvis.alth...@gmail.com
 
  Hello together
 
  On my gentoo  servers (2 Node Cluster with kernel 3.x) i use heartbeat
  3.0.5 and DRBD 8.4.0 for block replication between the two machines
  which served apache, mysql and samba fileservices
 
  Everything works fine, except the automatic sync between the two drives
  wich are both primarys
 
  What did i wrong?
 
 
  conf files see below
 
  drbd.conf
 
 
 
  resource r0 {
  # protocol to use; C is the the safest variant
net {
allow-two-primaries;
}
protocol C;
   startup {
   become-primary-on both;
  #timeout (in seconds) for the connection on startup
  wfc-timeout   90;
  # timeout (in seconds) for the connection on startup
  #after detection of data inconsistencies (degraded mode)
  degr-wfc-timeout  120;
  }
 syncer {
  # maximum bandwidth to use for this resource
  rate 100M;
  }
  on mail2 {
  ### options for master-server ###
  # name of the allocated blockdevice
  device /dev/drbd0;
  # underlying blockdevice
  disk   /dev/sdb1;
  #address and port to use for the synchronisation
  # here we use the heartbeat network
  address10.0.0.1:7788;
  # where to store DRBD metadata; here it's on the underlying device
 itself
  meta-disk  internal;
  }
  on disthost3 {
  device /dev/drbd1;
  disk /dev/sda6;
  address 10.0.0.2:7788;
  meta-disk internal;
  }
 
 
  haresoures file for heartbeat
 
 
  mail2 10.0.0.3 drbddisk::r0 Filesystem::/dev/drbd0::/drfs::ext3 apache2
  mysql bind samba
 
  ha.cf
 
 
 # Logging
 debug  1
 use_logd   true
  logfacility daemon
 
  # Misc Options
  traditional_compression off
  compression bz2
  coredumps   true
  auto_failback   on
 
  # Communications
  udpport 694
  #ucast  eth1 10.0.0.1
  bcast   eth1
  #autojoin   any
 
  # Thresholds (in seconds)
  keepalive   2
  warntime5
  deadtime15
  initdead60
  crm no
  nodemail2
  nodedisthost3
  ~
 
 
  thanks for your help
 
  --
  Freundliche Grüsse
 
  Elvis Altherr
  Brauerstrasse 83a
  9016 St. Gallen
  071 280 13 79 (Privat)
  elvis.alth...@gmail.com
 
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 
 
 
 yes.. woud i better use GFS2 or OFCS (which both dosen't work under
 kernel 3.x) ?

 Or which is the best file system porpouse for my case?



 --
 Freundliche Grüsse

 Elvis Altherr
 Brauerstrasse 83a
 9016 St. Gallen
 071 280 13 79 (Privat)
 elvis.alth...@gmail.com

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] mount.ocfs2 in D state

2012-07-02 Thread emmanuel segura

Do you have a stonith configured?

2012/7/2 EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) 
external.martin.kon...@de.bosch.com

 Hi,

 when a split brain (drbd) happens mount.ocfs2 remains hanging unkillable
 in D-state.

 rt-lxcl9a:~ # ps aux | grep ocf
 root   347  0.0  0.0  10468   740 ?D10:25   0:00
 /sbin/mount.ocfs2 /dev/drbd_r0 /SHARED -o rw
 root   349  0.0  0.0  0 0 ?S10:25   0:00 [ocfs2dc]
 root  5906  0.0  0.0  11552  1796 ?S10:36   0:00 /bin/sh
 /usr/lib/ocf/resource.d//heartbeat/Filesystem stop
 root 32715  0.0  0.0  0 0 ?S   10:25   0:00 [ocfs2_wq]
 root 32717  0.0  0.0  90776  2120 ?Ss   10:25   0:00
 /usr/sbin/ocfs2_controld.pcmk

 As a consequence I do  not know how to resolve the split brain situation
 as I cannot demote anything anymore.

 Is this a known bug?

 Best regards

 Martin Konold

 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.com

 Tel. +49 7121 35 3322

 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Franz Fehrenbach; Geschäftsführung: Volkmar
 Denner, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Dirk Hoheisel, Christoph
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] mount.ocfs2 in D state

2012-07-02 Thread emmanuel segura

remove the standby on node node rt-lxcl9a

2012/7/2 EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) 
external.martin.kon...@de.bosch.com

 Hi,

  Do you have a stonith configured?

 Yes. Though a hanging mount does not cause stonith to become activated.

 node rt-lxcl9a \
 attributes standby=on
 node rt-lxcl9b \
 attributes standby=off
 primitive dlm ocf:pacemaker:controld \
 op monitor interval=60 timeout=60 \
 meta target-role=Stopped
 primitive ip-rt-lxlr9a ocf:heartbeat:IPaddr \
 params ip=10.13.132.94 cidr_netmask=255.255.252.0 \
 op monitor interval=5s timeout=20s depth=0 \
 meta target-role=Started
 primitive ip-rt-lxlr9b ocf:heartbeat:IPaddr \
 params ip=10.13.132.95 cidr_netmask=255.255.252.0 \
 op monitor interval=5s timeout=20s depth=0 \
 meta target-role=Started
 primitive o2cb ocf:ocfs2:o2cb \
 op monitor interval=60 timeout=60 \
 meta target-role=Stopped
 primitive resDRBD ocf:linbit:drbd \
 params drbd_resource=r0 \
 operations $id=resDRBD-operations \
 op monitor interval=20 role=Master timeout=20 \
 op monitor interval=30 role=Slave timeout=20 \
 meta target-role=Stopped
 primitive resource-fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd_r0 directory=/SHARED fstype=ocfs2 \
 op monitor interval=120s \
 meta target-role=Stopped
 primitive stonith-ilo-rt-lxcl9ar stonith:external/ipmi \
 params hostname=rt-lxcl9a ipaddr=10.13.172.85 userid=stonith
 passwd=stonithstonith passwd_method=param interface=lanplus
 pcmk_host_check=static-list pcmk_host_list=rt-lxcl9a \
 meta target-role=Started
 primitive stonith-ilo-rt-lxcl9br stonith:external/ipmi \
 params hostname=rt-lxcl9b ipaddr=10.13.172.93 userid=stonith
 passwd=stonithstonith passwd_method=param interface=lanplus
 pcmk_host_check=static-list pcmk_host_list=rt-lxcl9b \
 meta target-role=Started
 ms msDRBD resDRBD \
 meta resource-stickines=100 notify=true master-max=2
 interleave=true target-role=Started
 clone clone-dlm dlm \
 meta globally-unique=false interleave=true
 target-role=Started
 clone clone-fs resource-fs \
 meta interleave=true ordered=true target-role=Started
 clone clone-ocb o2cb \
 meta globally-unique=false interleave=true
 target-role=Stopped
 location location-stonith-ilo-rt-lxcl9ar stonith-ilo-rt-lxcl9ar -inf:
 rt-lxcl9a
 location location-stonith-ilo-rt-lxcl9br stonith-ilo-rt-lxcl9br -inf:
 rt-lxcl9b
 colocation colocation-dlm-drbd inf: clone-dlm msDRBD:Master
 colocation colocation-fs-o2cb inf: clone-fs clone-ocb
 colocation colocation-ocation-dlm inf: clone-ocb clone-dlm
 order order-dlm-o2cb 0: clone-dlm clone-ocb
 order order-drbd-dlm 0: msDRBD:promote clone-dlm:start
 order order-o2cb-fs 0: clone-ocb clone-fs
 property $id=cib-bootstrap-options \
 stonith-enabled=true \
 no-quorum-policy=ignore \
 placement-strategy=balanced \
 dc-version=1.1.6-b988976485d15cb702c9307df55512d323831a5e \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 last-lrm-refresh=1341218156 \
 stonith-timeout=30s \
 maintenance-mode=true
 rsc_defaults $id=rsc-options \
 resource-stickiness=200 \
 migration-threshold=3
 op_defaults $id=op-options \
 timeout=600 \
 record-pending=true

 Yours,
 -- martin
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-19 Thread emmanuel segura

First of all the parameter 201 it must be diferent for every resource


2012/6/19 Martin Marji Cermak cerm...@gmail.com

 Hello guys,
 I have 3 questions if you please.

 I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes
 plus 1 quorum node, in semi Active-Active configuration.
 By semi, I mean that both NFS nodes are active and each of them is under
 normal circumstances exclusively responsible for one (out of two) Volume
 Group - using the ocf:heartbeat:LVM RA.
 Each LVM volume group lives on a dedicated multipath iscsi device, exported
 from a shared SAN.

 I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need
 to make it available for 2 separate /21 networks as read-only, and for 3
 different servers as read-write.
 I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use
 the ocf:heartbeat:exportfs RA 5 times.


 The configuration (only IP addresses changed) is here:
 http://pastebin.com/eHkgUv64


 1) is there a way how to export this directory 5 times without defining 5
 ocf:heartbeat:exportfs primitives? It's a lot of duplications...
 I search all the forums and I fear the ocf:heartbeat:exportfs simply
 supports only one host / network range. But maybe someone has been working
 on a patch?



 2) while using the ocf:heartbeat:exportfs 5 times for the same directory,
 do I have to use the _same_ FSID (201 in my config) for all these 5
 primitives (as Im exporting the _same_ filesystem / directory)?
 I'm getting this warning when doing so

 WARNING: Resources

 p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs
 violate uniqueness for parameter fsid: 201
 Do you still want to commit?



 3) wait_for_leasetime_on_stop - I believe this must be set to true when
 exporting NFSv4  with ocf:heartbeat:exportfs.
 http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html

 My 5 exportfs primitives reside in the same group:

 group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs
 p_exportfs_software_repos_buller p_fs_software_repos
 p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
 p_exportfs_software_repos_iap-mgmt p_ip02 \
meta resource-stickiness=101


 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a
 failover of the NFS group from one NFS node to the second node would take
 more than 50 seconds,
 as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10
 seconds 5 times.

 Is there any way of making them fail over / sleeping in parallel, instead
 of sequential?

 I workarounded this by setting wait_for_leasetime_on_stop=true for only
 one of these (which I believe is safe and does the job it is expected to do
 - please correct me if I'm wrong).



 Thank you for your valuable comments.

 My Pacemaker configuration: http://pastebin.com/eHkgUv64


 [root@irvine ~]# facter | egrep 'lsbdistid|lsbdistrelease'
 lsbdistid = CentOS
 lsbdistrelease = 6.2

 [root@irvine ~]# rpm -qa | egrep 'pacemaker|corosync|agents'

 corosync-1.4.1-4.el6_2.2.x86_64
 pacemaker-cli-1.1.6-3.el6.x86_64
 pacemaker-libs-1.1.6-3.el6.x86_64
 corosynclib-1.4.1-4.el6_2.2.x86_64
 pacemaker-cluster-libs-1.1.6-3.el6.x86_64
 pacemaker-1.1.6-3.el6.x86_64
 fence-agents-3.1.5-10.el6_2.2.x86_64

 resource-agents-3.9.2-7.el6.x86_64
   with /usr/lib/ocf/resource.d/heartbeat/exportfs updated by hand from:

 https://github.com/ClusterLabs/resource-agents/commits/master/heartbeat/exportfs

 Thank you very much
 Marji Cermak
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Active/Active Cluster

2012-06-09 Thread emmanuel segura

Why you are using cman  corosync together?

I think you should use cman+pacemaker or corosync+pacemaker



2012/6/9 Yount, William D yount.will...@menloworldwide.com

 I have two servers which are both Dell 990's. Each server has two 1tb hard
 drives configured in RAID0. I have installed CentOS on both and they have
 the same partition sizes. I am using /dev/KNTCLFS00X/Storage as a drbd
 partition and attaching it to /dev/drbd0. DRBD syncing appears to working
 fine.

 I am trying to setup an Active/Active cluster. I have set up CMAN. I want
 to use this cluster just for NFS storage.
 I want to have these services running on both nodes at the same time:

 * IP Address

 * DRBD

 * Filesystem (gcfs2)

 Through a combination of official documentation and LCMC, I have this
 setup. However, I am getting this:

 
 Last updated: Fri Jun  8 23:11:58 2012
 Last change: Fri Jun  8 23:11:37 2012 via crmd on KNTCLFS002
 Stack: cman
 Current DC: KNTCLFS002 - partition with quorum
 Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
 2 Nodes configured, 2 expected votes
 6 Resources configured.
 

 Online: [ KNTCLFS001 KNTCLFS002 ]

 Clone Set: cl_IPaddr2_1 [res_IPaddr2_1]
 Started: [ KNTCLFS002 KNTCLFS001 ]
 Master/Slave Set: ms_drbd_2 [res_drbd_2]
 Masters: [ KNTCLFS002 ]
 Stopped: [ res_drbd_2:0 ]
 Clone Set: cl_Filesystem_1 [res_Filesystem_1] (unique)
 res_Filesystem_1:0 (ocf::heartbeat:Filesystem):Stopped
 res_Filesystem_1:1 (ocf::heartbeat:Filesystem):Started KNTCLFS002

 Failed actions:
res_Filesystem_1:0_start_0 (node=KNTCLFS002, call=42, rc=1,
 status=complete): unknown error
res_Filesystem_1:0_start_0 (node=KNTCLFS001, call=87, rc=1,
 status=complete): unknown error

 I have attached my nfs.res, cluster.conf and corosync.conf files.  Please
 let me know if I can provide any other information to help resolve this.



 Thanks,
 William Yount | Systems Analyst | Menlo Worldwide | Cell: 901-654-9933
 Safety | Leadership | Integrity | Commitment | Excellence
 Please consider the environment before printing this e-mail


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with nfs and exportfs failover

2012-04-14 Thread emmanuel segura

Maybe the problem it's the primitive nfsserver lsb:nfs-kernel-server, i
think this primitive was stoped befoure exportfs-admin
ocf:heartbeat:exportfs

And if i rember the lsb:nfs-kernel-server and exportfs agent does the same
thing

the first use the os scripts and the second the cluster agents

Il giorno 14 aprile 2012 01:50, William Seligman
selig...@nevis.columbia.edu ha scritto:

On 4/13/12 7:18 PM, William Seligman wrote:
On 4/13/12 6:42 PM, Seth Galitzer wrote:
In attempting to build a nice clean config, I'm now in a state where
exportfs never starts. It always times out and errors.

crm config show is pasted here: http://pastebin.com/cKFFL0Xf
syslog after an attempted restart here: http://pastebin.com/CHdF21M4

Only IPs have been edited.

It's clear that your exportfs resource is timing out for the admin
resource.

I'm no expert, but here are some stupid exportfs tricks to try:

- Check your /etc/exports file (or whatever the equivalent is in Debian;
man
exportfs will tell you) on both nodes. Make sure you're not already
exporting
the directory when the NFS server starts.

- Take out the exportfs-admin resource. Then try doing things manually:

# exportfs x.x.x.0/24:/exports/admin

Assuming that works, then look at the output of just

# exportfs

The clientspec reported by exportfs has to match the clientspec you put
into the
resource exactly. If exportfs is canonicalizing or reporting the
clientspec
differently, the exportfs monitor won't work. If this is the case,
change the
clientspec parameter in exportfs-admin to match.

If the output of exportfs has any results that span more than one line,
then
you've got the problem that the patch I referred you to (quoted below) is
supposed to fix. You'll have to apply the patch to your exportfs
resource.

Wait a second; I completely forgot about this thread that I started:

http://www.gossamer-threads.com/lists/linuxha/users/78585

The solution turned out to be to remove the .rmtab files from the
directories I
was exporting, deleting touching /var/lib/nfs/rmtab (you'll have to look
up
the Debian location), and adding rmtab_backup=none to all my exportfs
resources.

Hopefully there's a solution for you in there somewhere!

On 04/13/2012 01:51 PM, William Seligman wrote:
On 4/13/12 12:38 PM, Seth Galitzer wrote:
I'm working through this howto doc:
http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf
and am stuck at section 4.4. When I put the primary node in standby,
it
seems that NFS never releases the export, so it can't shut down, and
thus can't get started on the secondary node. Everything up to that
point in the doc works fine and fails over correctly. But once I add
the exportfs resource, it fails. I'm running this on debian wheezy
with
the included standard packages, not custom.

Any suggestions? I'd be happy to post configs and logs if requested.

Yes, please post the output of crm configure show, the output of
exportfs
while the resource is running properly, and the relevant sections of
your log
file. I suggest using pastebin.com, to keep mailboxes filling up with
walls of
text.

In case you haven't seen this thread already, you might want to take a
look:

http://www.gossamer-threads.com/lists/linuxha/dev/77166

And the resulting commit:

https://github.com/ClusterLabs/resource-agents/commit/5b0bf96e77ed3c4e179c8b4c6a5ffd4709f8fdae

(Links courtesy of Lars Ellenberg.)

The problem and patch discussed in those links doesn't quite match
what you
describe. I mention it because I had to patch my exportfs resource (in
/usr/lib/ocf/resource.d/heartbeat/exportfs on my RHEL systems) to get
it to work
properly in my setup.

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

--
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with stonith RA using external/ipmi over lan or lanplus interface

2012-04-12 Thread emmanuel segura

If i rember well, for use Ilo 3 card you sould use the cluster agent

ipmilan

Il giorno 11 aprile 2012 23:00, Pham, Tom tom.p...@viasat.com ha scritto:

 Hi everyone,

 I try to test a 2 nodes cluster with stonith resource using external/ipmi
 ( I tried external/riloe first but it does not seem to work)
 My system has HP Proliant BL460c G7 with iLO 3 card Firmware 1.15
 SUSE 11
 Corosync version 1.2.7 ; Pacmaker 1.0.9

 When I use the interface lan or lanplus, It failed to start the stonith
 resource. I get the error below
 external/ipmi[12173]: [12184]: ERROR: error executing ipmitool: Error:
 Unable to establish IPMI v2 / RMCP+ session  Unable to get Chassis Power
 Status

 However, when I used the interface = open instead lan/lanplus ,it started
 the stonith resource fine. When I tried to kill -9 corosync in node1, I
 expect it will reboot node1 and started all resource on node2. But it
 reboot node1. Someone mentioned that open interface is a local interface
 and only allows to fence itself.

 Anyone knows why the lan/lanplus does not work?

 Thanks

 Tom Pham

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-27 Thread emmanuel segura

William

But i would like to know if you have a lvm resource in your pacemaker
configuration

Remember clvmd it's not for active di vg or lv it's for propagate the lvm
meta data on all node of the cluster



Il giorno 26 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/26/12 4:28 PM, emmanuel segura wrote:
  Sorry Willian i can't post my config now because i'm at home now  not in
 my
  job
 
  I think it's no a problem if clvm start before drbd, because clvm not
  needed and devices to start
 
  This it's the point, i hope to be clear
 
  The introduction of pacemaker in redhat cluster was thinked  for replace
  rgmanager not whole cluster stack
 
  and i suggest you to start clvmd at boot time
 
  chkconfig clvmd on

 I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I get:

 Mounting GFS2 filesystem (/usr/nevis): invalid device path
 /dev/mapper/ADMIN-usr
   [FAILED]

 ... and so on, because the ADMIN volume group was never loaded by clvmd.
 Without
 a vgscan in there somewhere, the system can't see the volume groups on
 the
 drbd resource.

  Sorry for my bad english :-) i can from a spanish country and all days i
  speak Italian

 I'm sorry that I don't speak more languages! You're the one who's helping
 me;
 it's my task to learn and understand. Certainly your English is better
 than my
 French or Russian.

  Il giorno 26 marzo 2012 22:04, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/26/12 3:48 PM, emmanuel segura wrote:
  I know it's normal fence_node doesn't work because the request of fence
  must be redirect to pacemaker stonith
 
  I think call the cluster agents with rgmanager it's really ugly thing,
 i
  never seen a cluster like this
  ==
  If I understand Pacemaker Explained http://bit.ly/GR5WEY and how
 I'd
  invoke
  clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would be
  invoked
  by either HA resource manager is exactly the same: /etc/init.d/clvmd.
  ==
 
  clvm doesn't need to be called from rgmanger in the cluster
 configuration
 
  this the boot sequence of redhat daemons
 
  1:cman, 2:clvm, 3:rgmanager
 
  and if you don't wanna use rgmanager you just replace rgmanager
 
  I'm sorry, but I don't think I understand what you're suggesting. Do you
  suggest
  that I start clvmd at boot? That won't work; clvmd won't see the volume
  groups
  on drbd until drbd is started and promoted to primary.
 
  May I ask you to post your own cluster.conf on pastebin.com so I can
 see
  how you
  do it? Along with crm configure show if that's relevant for your
 cluster?
 
  Il giorno 26 marzo 2012 19:21, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/24/12 5:40 PM, emmanuel segura wrote:
  I think it's better you use clvmd with cman
 
  I don't now why you use the lsb script of clvm
 
  On Redhat clvmd need of cman and you try to running with pacemaker, i
  not
  sure this is the problem but this type of configuration it's so
 strange
 
  I made it a virtual cluster with kvm and i not foud a problems
 
  While I appreciate the advice, it's not immediately clear that trying
 to
  eliminate pacemaker would do me any good. Perhaps someone can
  demonstrate
  the
  error in my reasoning:
 
  If I understand Pacemaker Explained http://bit.ly/GR5WEY and how
  I'd
  invoke
  clvmd from cman http://bit.ly/H6ZbKg, the clvmd script that would
 be
  invoked
  by either HA resource manager is exactly the same: /etc/init.d/clvmd.
 
  If I tried to use cman instead of pacemaker, I'd be cutting myself off
  from the
  pacemaker features that cman/rgmanager does not yet have available,
  such as
  pacemaker's symlink, exportfs, and clonable IPaddr2 resources.
 
  I recognize I've got a strange problem. Given that fence_node doesn't
  work
  but
  stonith_admin does, I strongly suspect that the problem is caused by
 the
  behavior of my fencing agent, not the use of pacemaker versus
 rgmanager,
  nor by
  how clvmd is being started.
 
  Il giorno 24 marzo 2012 13:09, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/24/12 4:47 AM, emmanuel segura wrote:
  How do you configure clvmd?
 
  with cman or with pacemaker?
 
  Pacemaker. Here's the output of 'crm configure show':
  http://pastebin.com/426CdVwN
 
  Il giorno 23 marzo 2012 22:14, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/23/12 5:03 PM, emmanuel segura wrote:
 
  Sorry but i would to know if can show me your
  /etc/cluster/cluster.conf
 
  Here it is: http://pastebin.com/GUr0CEgZ
 
  Il giorno 23 marzo 2012 21:50, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/22/12 2:43 PM, William Seligman wrote:
  On 3/20/12 4:55 PM, Lars Ellenberg wrote:
  On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

2012-03-27 Thread emmanuel segura

William :-)

So now your cluster it's OK?

Il giorno 27 marzo 2012 00:33, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/26/12 5:31 PM, William Seligman wrote:
  On 3/26/12 5:17 PM, William Seligman wrote:
  On 3/26/12 4:28 PM, emmanuel segura wrote:

  and i suggest you to start clvmd at boot time
 
  chkconfig clvmd on
 
  I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I
 get:
 
  Mounting GFS2 filesystem (/usr/nevis): invalid device path
 /dev/mapper/ADMIN-usr
 [FAILED]
 
  ... and so on, because the ADMIN volume group was never loaded by
 clvmd. Without
  a vgscan in there somewhere, the system can't see the volume groups
 on the
  drbd resource.
 
  Wait a second... there's an ocf:heartbeat:LVM resource! Testing...

 Emannuel, you did it!

 For the sake of future searches, and possibly future documentation, let me
 start
 with my original description of the problem:

  I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
 Clusters
  From Scratch. Fencing is through forcibly rebooting a node by cutting
 and
  restoring its power via UPS.
 
  My fencing/failover tests have revealed a problem. If I gracefully turn
 off one
  node (crm node standby; service pacemaker stop; shutdown -r now)
 all the
  resources transfer to the other node with no problems. If I cut power to
 one
  node (as would happen if it were fenced), the lsb::clvmd resource on the
  remaining node eventually fails. Since all the other resources depend on
 clvmd,
  all the resources on the remaining node stop and the cluster is left with
  nothing running.
 
  I've traced why the lsb::clvmd fails: The monitor/status command includes
  vgdisplay, which hangs indefinitely. Therefore the monitor will always
 time-out.
 
  So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
 cut
  off, the cluster isn't handling it properly. Has anyone on this list
 seen this
  before? Any ideas?
 
  Details:
 
  versions:
  Redhat Linux 6.2 (kernel 2.6.32)
  cman-3.0.12.1
  corosync-1.4.1
  pacemaker-1.1.6
  lvm2-2.02.87
  lvm2-cluster-2.02.87

 The problem is that clvmd on the main node will hang if there's a
 substantive
 period of time during which the other node returns running cman but not
 clvmd. I
 never tracked down why this happens, but there's a practical solution:
 minimize
 any interval for which that would be true. To ensure this, take clvmd
 outside
 the resource manager's control:

 chkconfig cman on
 chkconfig clvmd on
 chkconfig pacemaker on

 On RHEL6.2, these services will be started in the above order; clvmd will
 start
 within a few seconds after cman.

 Here's my cluster.conf http://pastebin.com/GUr0CEgZ and the output of
 crm
 configure show http://pastebin.com/f9D4Ui5Z. The key lines from the
 latter are:

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin
 primitive AdminLvm ocf:heartbeat:LVM \
params volgrpname=ADMIN \
op monitor interval=30 timeout=100 depth=0
 primitive Gfs2 lsb:gfs2
 group VolumeGroup AdminLvm Gfs2
 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 \
clone-max=2 clone-node-max=1 \
notify=true interleave=true
 clone VolumeClone VolumeGroup \
meta interleave=true
 colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
 order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start

 What I learned: If one is going to extend the example in Clusters From
 Scratch
 to include logical volumes, one must start clvmd at boot time, and include
 any
 volume groups in ocf:heartbeat:LVM resources that start before gfs2.

 Note the long timeout on the ocf:heartbeat:LVM resource. This is a good
 idea
 because, during the boot of the crashed node, there'll still be an
 interval of a
 few seconds when cman will be running but clvmd won't be. During my tests,
 the
 LVM monitor would fail if it checked during that interval with a timeout
 that
 was shorter than it took clvmd to start on the crashed node. This was
 annoying;
 all resources dependent on AdminLvm would be stopped until AdminLvm
 recovered (a
 few more seconds). Increasing the timeout avoids this.

 It also means that during any recovery procedure on the crashed node for
 which I
 turn off all the services, I have to minimize the interval between the
 start of
 cman and clvmd if I've turned off services at boot; e.g.,

 service drbd start # ... and fix any split-brain problems or whatever
 service cman start; service clvmd start # put on one line
 service pacemaker start

 I thank everyone on this list who was patient with me as I pounded on this
 problem for two weeks!
 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread emmanuel segura

How do you configure clvmd?

with cman or with pacemaker?

Il giorno 23 marzo 2012 22:14, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/23/12 5:03 PM, emmanuel segura wrote:

  Sorry but i would to know if can show me your /etc/cluster/cluster.conf

 Here it is: http://pastebin.com/GUr0CEgZ

  Il giorno 23 marzo 2012 21:50, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/22/12 2:43 PM, William Seligman wrote:
  On 3/20/12 4:55 PM, Lars Ellenberg wrote:
  On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
  On 3/16/12 12:12 PM, William Seligman wrote:
  On 3/16/12 7:02 AM, Andreas Kurz wrote:
  On 03/15/2012 11:50 PM, William Seligman wrote:
  On 3/15/12 6:07 PM, William Seligman wrote:
  On 3/15/12 6:05 PM, William Seligman wrote:
  On 3/15/12 4:57 PM, emmanuel segura wrote:
 
  we can try to understand what happen when clvm hang
 
  edit the /etc/lvm/lvm.conf  and change level = 7 in the log
  session and
  uncomment this line
 
  file = /var/log/lvm2.log
 
  Here's the tail end of the file (the original is 1.6M). Because
  there no times
  in the log, it's hard for me to point you to the point where I
  crashed the other
  system. I think (though I'm not sure) that the crash happened
  after the last
  occurrence of
 
  cache/lvmcache.c:1484   Wiping internal VG cache
 
  Honestly, it looks like a wall of text to me. Does it suggest
  anything to you?
 
  Maybe it would help if I included the link to the pastebin where
 I
  put the
  output: http://pastebin.com/8pgW3Muw
 
  Could the problem be with lvm+drbd?
 
  In lvm2.conf, I see this sequence of lines pre-crash:
 
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  filters/filter-composite.c:31   Using /dev/md0
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  label/label.c:186   /dev/md0: No label detected
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
  device/dev-io.c:588   Closed /dev/drbd0
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
  device/dev-io.c:588   Closed /dev/drbd0
 
  I interpret this: Look at /dev/md0, get some info, close; look at
  /dev/drbd0,
  get some info, close.
 
  Post-crash, I see:
 
  evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  filters/filter-composite.c:31   Using /dev/md0
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  label/label.c:186   /dev/md0: No label detected
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 
  ... and then it hangs. Comparing the two, it looks like it can't
  close /dev/drbd0.
 
  If I look at /proc/drbd when I crash one node, I see this:
 
  # cat /proc/drbd
  version: 8.3.12 (api:88/proto:86-96)
  GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
  r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
   0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C
 s-
  ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0
 ap:0
  ep:1 wo:b oos:0
 
  s- ... DRBD suspended io, most likely because of it's
  fencing-policy. For valid dual-primary setups you have to use
  resource-and-stonith policy and a working fence-peer handler.
 In
  this mode I/O is suspended until fencing of peer was succesful.
  Question
  is, why the peer does _not_ also suspend its I/O because obviously
  fencing was not successful .
 
  So with a correct DRBD configuration one of your nodes should
 already
  have been fenced because of connection loss between nodes (on drbd
  replication link).
 
  You can use e.g. that nice fencing script:
 
  http://goo.gl/O4N8f
 
  This is the output of drbdadm dump admin: 
  http://pastebin.com/kTxvHCtx
 
  So I've got resource-and-stonith. I gather from an earlier thread

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-24 Thread emmanuel segura

I think it's better you use clvmd with cman

I don't now why you use the lsb script of clvm

On Redhat clvmd need of cman and you try to running with pacemaker, i not
sure this is the problem but this type of configuration it's so strange

I made it a virtual cluster with kvm and i not foud a problems

Il giorno 24 marzo 2012 13:09, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/24/12 4:47 AM, emmanuel segura wrote:
  How do you configure clvmd?
 
  with cman or with pacemaker?

 Pacemaker. Here's the output of 'crm configure show':
 http://pastebin.com/426CdVwN

  Il giorno 23 marzo 2012 22:14, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/23/12 5:03 PM, emmanuel segura wrote:
 
  Sorry but i would to know if can show me your /etc/cluster/cluster.conf
 
  Here it is: http://pastebin.com/GUr0CEgZ
 
  Il giorno 23 marzo 2012 21:50, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/22/12 2:43 PM, William Seligman wrote:
  On 3/20/12 4:55 PM, Lars Ellenberg wrote:
  On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
  On 3/16/12 12:12 PM, William Seligman wrote:
  On 3/16/12 7:02 AM, Andreas Kurz wrote:
 
  s- ... DRBD suspended io, most likely because of it's
  fencing-policy. For valid dual-primary setups you have to use
  resource-and-stonith policy and a working fence-peer handler.
  In
  this mode I/O is suspended until fencing of peer was succesful.
  Question
  is, why the peer does _not_ also suspend its I/O because
 obviously
  fencing was not successful .
 
  So with a correct DRBD configuration one of your nodes should
  already
  have been fenced because of connection loss between nodes (on
 drbd
  replication link).
 
  You can use e.g. that nice fencing script:
 
  http://goo.gl/O4N8f
 
  This is the output of drbdadm dump admin: 
  http://pastebin.com/kTxvHCtx
 
  So I've got resource-and-stonith. I gather from an earlier thread
  that
  obliterate-peer.sh is more-or-less equivalent in functionality
 with
  stonith_admin_fence_peer.sh:
 
  http://www.gossamer-threads.com/lists/linuxha/users/78504#78504
 
  At the moment I'm pursuing the possibility that I'm returning the
  wrong return
  codes from my fencing agent:
 
  http://www.gossamer-threads.com/lists/linuxha/users/78572
 
  I cleaned up my fencing agent, making sure its return code matched
  those
  returned by other agents in /usr/sbin/fence_, and allowing for some
  delay issues
  in reading the UPS status. But...
 
  After that, I'll look at another suggestion with lvm.conf:
 
  http://www.gossamer-threads.com/lists/linuxha/users/78796#78796
 
  Then I'll try DRBD 8.4.1. Hopefully one of these is the source of
  the
  issue.
 
  Failure on all three counts.
 
  May I suggest you double check the permissions on your fence peer
  script?
  I suspect you may simply have forgotten the chmod +x .
 
  Test with drbdadm fence-peer minor-0 from the command line.
 
  I still haven't solved the problem, but this advice has gotten me
  further than
  before.
 
  First, Lars was correct: I did not have execute permissions set on my
  fence peer
  scripts. (D'oh!) I turned them on, but that did not change anything:
  cman+clvmd
  still hung on the vgdisplay command if I crashed the peer node.
 
  I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried
  Lars'
  suggested command. I didn't save the response for this message (d'oh
  again!) but
  it said that the fence-peer script had failed.
 
  Hmm. The peer was definitely shutting down, so my fencing script is
  working. I
  went over it, comparing the return codes to those of the existing
  scripts, and
  made some changes. Here's my current script: 
  http://pastebin.com/nUnYVcBK.
 
  Up until now my fence-peer scripts had either been Lon Hohberger's
  obliterate-peer.sh or Digimer's rhcs_fence. I decided to try
  stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the
  first two
  scripts, which fence using fence_node, the latter script just calls
  stonith_admin.
 
  When I tried the stonith_admin-fence-peer.sh script, it worked:
 
  # drbdadm fence-peer minor-0
  stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced
  peer
  orestes-corosync.nevis.columbia.edu.
 
  Power was cut on the peer, the remaining node stayed up. Then I
 brought
  up the
  peer with:
 
  stonith_admin -U orestes-corosync.nevis.columbia.edu
 
  BUT: When the restored peer came up and started to run cman, the
 clvmd
  hung on
  the main node again.
 
  After cycling through some more tests, I found that if I brought down
  the peer
  with drbdadm, then brought up with the peer with no HA services, then
  started
  drbd and then cman, the cluster remained intact.
 
  If I crashed the peer, the scheme in the previous paragraph didn't
  work.
  I bring
  up drbd, check that the disks are both UpToDate, then bring up cman.
 At
  that
  point the vgdisplay on the main node takes so long to run

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-23 Thread emmanuel segura

Hello William

Sorry but i would to know if can show me your /etc/cluster/cluster.conf

Il giorno 23 marzo 2012 21:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/22/12 2:43 PM, William Seligman wrote:
  On 3/20/12 4:55 PM, Lars Ellenberg wrote:
  On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
  On 3/16/12 12:12 PM, William Seligman wrote:
  On 3/16/12 7:02 AM, Andreas Kurz wrote:
  On 03/15/2012 11:50 PM, William Seligman wrote:
  On 3/15/12 6:07 PM, William Seligman wrote:
  On 3/15/12 6:05 PM, William Seligman wrote:
  On 3/15/12 4:57 PM, emmanuel segura wrote:
 
  we can try to understand what happen when clvm hang
 
  edit the /etc/lvm/lvm.conf  and change level = 7 in the log
 session and
  uncomment this line
 
  file = /var/log/lvm2.log
 
  Here's the tail end of the file (the original is 1.6M). Because
 there no times
  in the log, it's hard for me to point you to the point where I
 crashed the other
  system. I think (though I'm not sure) that the crash happened
 after the last
  occurrence of
 
  cache/lvmcache.c:1484   Wiping internal VG cache
 
  Honestly, it looks like a wall of text to me. Does it suggest
 anything to you?
 
  Maybe it would help if I included the link to the pastebin where I
 put the
  output: http://pastebin.com/8pgW3Muw
 
  Could the problem be with lvm+drbd?
 
  In lvm2.conf, I see this sequence of lines pre-crash:
 
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  filters/filter-composite.c:31   Using /dev/md0
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  label/label.c:186   /dev/md0: No label detected
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
  device/dev-io.c:588   Closed /dev/drbd0
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
  device/dev-io.c:588   Closed /dev/drbd0
 
  I interpret this: Look at /dev/md0, get some info, close; look at
 /dev/drbd0,
  get some info, close.
 
  Post-crash, I see:
 
  evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:271   /dev/md0: size is 1027968 sectors
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  device/dev-io.c:588   Closed /dev/md0
  filters/filter-composite.c:31   Using /dev/md0
  device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
  device/dev-io.c:137   /dev/md0: block size is 1024 bytes
  label/label.c:186   /dev/md0: No label detected
  device/dev-io.c:588   Closed /dev/md0
  device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
  device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
  device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 
  ... and then it hangs. Comparing the two, it looks like it can't
 close /dev/drbd0.
 
  If I look at /proc/drbd when I crash one node, I see this:
 
  # cat /proc/drbd
  version: 8.3.12 (api:88/proto:86-96)
  GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
  r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
   0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
  ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0
 ep:1 wo:b oos:0
 
  s- ... DRBD suspended io, most likely because of it's
  fencing-policy. For valid dual-primary setups you have to use
  resource-and-stonith policy and a working fence-peer handler. In
  this mode I/O is suspended until fencing of peer was succesful.
 Question
  is, why the peer does _not_ also suspend its I/O because obviously
  fencing was not successful .
 
  So with a correct DRBD configuration one of your nodes should already
  have been fenced because of connection loss between nodes (on drbd
  replication link).
 
  You can use e.g. that nice fencing script:
 
  http://goo.gl/O4N8f
 
  This is the output of drbdadm dump admin: 
 http://pastebin.com/kTxvHCtx
 
  So I've got resource-and-stonith. I gather from an earlier thread that
  obliterate-peer.sh is more-or-less equivalent in functionality with
  stonith_admin_fence_peer.sh:
 
  http://www.gossamer-threads.com/lists/linuxha/users/78504#78504
 
  At the moment I'm pursuing the possibility that I'm returning the
 wrong

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread emmanuel segura

Hello William

for the lvm hang you can use this in your /etc/lvm/lvm.conf

ignore_suspended_devices = 1

because i seen in the lvm log,

===
and then it hangs. Comparing the two, it looks like it can't close
/dev/drbd0
===



Il giorno 15 marzo 2012 23:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 6:07 PM, William Seligman wrote:
  On 3/15/12 6:05 PM, William Seligman wrote:
  On 3/15/12 4:57 PM, emmanuel segura wrote:
 
  we can try to understand what happen when clvm hang
 
  edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
  uncomment this line
 
  file = /var/log/lvm2.log
 
  Here's the tail end of the file (the original is 1.6M). Because there
 no times
  in the log, it's hard for me to point you to the point where I crashed
 the other
  system. I think (though I'm not sure) that the crash happened after the
 last
  occurrence of
 
  cache/lvmcache.c:1484   Wiping internal VG cache
 
  Honestly, it looks like a wall of text to me. Does it suggest anything
 to you?
 
  Maybe it would help if I included the link to the pastebin where I put
 the
  output: http://pastebin.com/8pgW3Muw

 Could the problem be with lvm+drbd?

 In lvm2.conf, I see this sequence of lines pre-crash:

 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0

 I interpret this: Look at /dev/md0, get some info, close; look at
 /dev/drbd0,
 get some info, close.

 Post-crash, I see:

 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes

 ... and then it hangs. Comparing the two, it looks like it can't close
 /dev/drbd0.

 If I look at /proc/drbd when I crash one node, I see this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
 wo:b oos:0


 If I look at /proc/drbd if I bring down one node gracefully (crm node
 standby),
 I get this:

 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1
 wo:b
 oos:0

 Could it be that drbd can't respond to certain requests from lvm if the
 state of
 the peer is DUnknown instead of Outdated?

  Il giorno 15 marzo 2012 20:50, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 12:55 PM, emmanuel segura wrote:
 
  I don't see any error and the answer for your question it's yes
 
  can you show me your /etc/cluster/cluster.conf and your crm configure
  show
 
  like that more later i can try to look if i found some fix
 
  Thanks for taking a look.
 
  My cluster.conf: http://pastebin.com/w5XNYyAX
  crm configure show: http://pastebin.com/atVkXjkn
 
  Before you spend a lot of time on the second file, remember that clvmd
  will hang
  whether

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura

Hello Willian

The first thing i seen in your clvmd log it's this

=
 WARNING: Locking disabled. Be careful! This could corrupt your metadata.
=

use this command

lvmconf --enable-cluster

and remember for cman+pacemaker you don't need qdisk

Il giorno 14 marzo 2012 23:17, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/14/12 9:20 AM, emmanuel segura wrote:
  Hello William
 
  i did new you are using drbd and i dont't know what type of configuration
  you using
 
  But it's better you try to start clvm with clvmd -d
 
  like thak we can see what it's the problem

 For what it's worth, here's the output of running clvmd -d on the node that
 stays up: http://pastebin.com/sWjaxAEF

 What's probably important in that big mass of output are the last two
 lines. Up
 to that point, I have both nodes up and running cman + clvmd; cluster.conf
 is
 here: http://pastebin.com/w5XNYyAX

 At the time of the next-to-the-last line, I cut power to the other node.

 At the time of the last line, I run vgdisplay on the remaining node,
 which
 hangs forever.

 After a lot of web searching, I found that I'm not the only one with this
 problem. Here's one case that doesn't seem relevant to me, since I don't
 use
 qdisk:
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
 Here's one with the same problem with the same OS:
 http://bugs.centos.org/view.php?id=5229, but with no resolution.

 Out of curiosity, has anyone on this list made a two-node cman+clvmd
 cluster
 work for them?

  Il giorno 14 marzo 2012 14:02, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 6:02 AM, emmanuel segura wrote:
 
   I think it's better you make clvmd start at boot
 
  chkconfig cman on ; chkconfig clvmd on
 
 
  I've already tried it. It doesn't work. The problem is that my LVM
  information is on the drbd. If I start up clvmd before drbd, it won't
 find
  the logical volumes.
 
  I also don't see why that would make a difference (although this could
 be
  part of the confusion): a service is a service. I've tried starting up
  clvmd inside and outside pacemaker control, with the same problem. Why
  would starting clvmd at boot make a difference?
 
   Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
  columbia.edu selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 3/13/12 5:50 PM, emmanuel segura wrote:
 
   So if you using cman why you use lsb::clvmd
 
  I think you are very confused
 
 
  I don't dispute that I may be very confused!
 
  However, from what I can tell, I still need to run clvmd even if
  I'm running cman (I'm not using rgmanager). If I just run cman,
  gfs2 and any other form of mount fails. If I run cman, then clvmd,
  then gfs2, everything behaves normally.
 
  Going by these instructions:
 
  https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial
 https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
 
 
  the resources he puts under cluster control (rgmanager) I have to
  put under pacemaker control. Those include drbd, clvmd, and gfs2.
 
  The difference between what I've got, and what's in Clusters From
  Scratch, is in CFS they assign one DRBD volume to a single
  filesystem. I create an LVM physical volume on my DRBD resource,
  as in the above tutorial, and so I have to start clvmd or the
  logical volumes in the DRBD partition won't be recognized. Is
  there some way to get logical volumes recognized automatically by
  cman without rgmanager that I've missed?
 
 
   Il giorno 13 marzo 2012 22:42, William Seligman
 
  selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 3/13/12 12:29 PM, William Seligman wrote:
 
  I'm not sure if this is a Linux-HA question; please direct
  me to the appropriate list if it's not.
 
  I'm setting up a two-node cman+pacemaker+gfs2 cluster as
  described in Clusters From Scratch. Fencing is through
  forcibly rebooting a node by cutting and restoring its power
  via UPS.
 
  My fencing/failover tests have revealed a problem. If I
  gracefully turn off one node (crm node standby; service
  pacemaker stop; shutdown -r now) all the resources
  transfer to the other node with no problems. If I cut power
  to one node (as would happen if it were fenced), the
  lsb::clvmd resource on the remaining node eventually fails.
  Since all the other resources depend on clvmd, all the
  resources on the remaining node stop and the cluster is left
  with nothing running.
 
  I've traced why the lsb::clvmd fails: The monitor/status
  command includes vgdisplay, which hangs indefinitely.
  Therefore the monitor will always time-out.
 
  So this isn't a problem with pacemaker, but with clvmd/dlm:
  If a node is cut off, the cluster isn't handling it properly.
  Has anyone on this list seen this before? Any ideas?
 
  Details:
 
 
  versions:
  Redhat Linux 6.2 (kernel 2.6.32)
  cman-3.0.12.1
  corosync-1.4.1
  pacemaker-1.1.6

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura

yes william

Now try clvmd -d and see what happen

locking_type = 3 it's lvm cluster lock type

Il giorno 15 marzo 2012 16:15, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 5:18 AM, emmanuel segura wrote:

  The first thing i seen in your clvmd log it's this
 
  =
   WARNING: Locking disabled. Be careful! This could corrupt your metadata.
  =

 I saw that too, and thought the same as you did. I did some checks (see
 below),
 but some web searches suggest that this message is a normal consequence of
 clvmd
 initialization; e.g.,

 http://markmail.org/message/vmy53pcv52wu7ghx

  use this command
 
  lvmconf --enable-cluster
 
  and remember for cman+pacemaker you don't need qdisk

 Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
 http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
 http://pastebin.com/rtw8c3Pf.

 Then I did as you suggested, but with a check to see if anything changed:

 # cd /etc/lvm/
 # cp lvm.conf lvm.conf.cluster
 # lvmconf --enable-cluster
 # diff lvm.conf lvm.conf.cluster
 #

 So the key lines have been there all along:
locking_type = 3
fallback_to_local_locking = 0


  Il giorno 14 marzo 2012 23:17, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 9:20 AM, emmanuel segura wrote:
  Hello William
 
  i did new you are using drbd and i dont't know what type of
 configuration
  you using
 
  But it's better you try to start clvm with clvmd -d
 
  like thak we can see what it's the problem
 
  For what it's worth, here's the output of running clvmd -d on the node
 that
  stays up: http://pastebin.com/sWjaxAEF
 
  What's probably important in that big mass of output are the last two
  lines. Up
  to that point, I have both nodes up and running cman + clvmd;
 cluster.conf
  is
  here: http://pastebin.com/w5XNYyAX
 
  At the time of the next-to-the-last line, I cut power to the other node.
 
  At the time of the last line, I run vgdisplay on the remaining node,
  which
  hangs forever.
 
  After a lot of web searching, I found that I'm not the only one with
 this
  problem. Here's one case that doesn't seem relevant to me, since I don't
  use
  qdisk:
  
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
  Here's one with the same problem with the same OS:
  http://bugs.centos.org/view.php?id=5229, but with no resolution.
 
  Out of curiosity, has anyone on this list made a two-node cman+clvmd
  cluster
  work for them?
 
  Il giorno 14 marzo 2012 14:02, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 6:02 AM, emmanuel segura wrote:
 
   I think it's better you make clvmd start at boot
 
  chkconfig cman on ; chkconfig clvmd on
 
 
  I've already tried it. It doesn't work. The problem is that my LVM
  information is on the drbd. If I start up clvmd before drbd, it won't
  find
  the logical volumes.
 
  I also don't see why that would make a difference (although this could
  be
  part of the confusion): a service is a service. I've tried starting up
  clvmd inside and outside pacemaker control, with the same problem. Why
  would starting clvmd at boot make a difference?
 
   Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
  columbia.edu selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 3/13/12 5:50 PM, emmanuel segura wrote:
 
   So if you using cman why you use lsb::clvmd
 
  I think you are very confused
 
 
  I don't dispute that I may be very confused!
 
  However, from what I can tell, I still need to run clvmd even if
  I'm running cman (I'm not using rgmanager). If I just run cman,
  gfs2 and any other form of mount fails. If I run cman, then clvmd,
  then gfs2, everything behaves normally.
 
  Going by these instructions:
 
  https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial
  https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
 
 
  the resources he puts under cluster control (rgmanager) I have to
  put under pacemaker control. Those include drbd, clvmd, and gfs2.
 
  The difference between what I've got, and what's in Clusters From
  Scratch, is in CFS they assign one DRBD volume to a single
  filesystem. I create an LVM physical volume on my DRBD resource,
  as in the above tutorial, and so I have to start clvmd or the
  logical volumes in the DRBD partition won't be recognized. Is
  there some way to get logical volumes recognized automatically by
  cman without rgmanager that I've missed?
 
 
   Il giorno 13 marzo 2012 22:42, William Seligman
 
  selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 3/13/12 12:29 PM, William Seligman wrote:
 
  I'm not sure if this is a Linux-HA question; please direct
  me to the appropriate list if it's not.
 
  I'm setting up a two-node cman+pacemaker+gfs2 cluster as
  described in Clusters From Scratch. Fencing is through
  forcibly rebooting a node by cutting and restoring

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura

Hello William

Ho did you created your volume group

give me the output of vgs command when the cluster it's up



Il giorno 15 marzo 2012 17:06, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 11:50 AM, emmanuel segura wrote:
  yes william
 
  Now try clvmd -d and see what happen
 
  locking_type = 3 it's lvm cluster lock type

 Since you asked for confirmation, here it is: the output of 'clvmd -d'
 just now.
 http://pastebin.com/bne8piEw. I crashed the other node at Mar 15
 12:02:35,
 when you see the only additional line of output.

 I don't see any particular difference between this and the previous result
 http://pastebin.com/sWjaxAEF, which suggests that I had cluster locking
 enabled before, and still do now.

  Il giorno 15 marzo 2012 16:15, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 5:18 AM, emmanuel segura wrote:
 
  The first thing i seen in your clvmd log it's this
 
  =
   WARNING: Locking disabled. Be careful! This could corrupt your
 metadata.
  =
 
  I saw that too, and thought the same as you did. I did some checks (see
  below),
  but some web searches suggest that this message is a normal consequence
 of
  clvmd
  initialization; e.g.,
 
  http://markmail.org/message/vmy53pcv52wu7ghx
 
  use this command
 
  lvmconf --enable-cluster
 
  and remember for cman+pacemaker you don't need qdisk
 
  Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
  http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
  http://pastebin.com/rtw8c3Pf.
 
  Then I did as you suggested, but with a check to see if anything
 changed:
 
  # cd /etc/lvm/
  # cp lvm.conf lvm.conf.cluster
  # lvmconf --enable-cluster
  # diff lvm.conf lvm.conf.cluster
  #
 
  So the key lines have been there all along:
 locking_type = 3
 fallback_to_local_locking = 0
 
 
  Il giorno 14 marzo 2012 23:17, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 9:20 AM, emmanuel segura wrote:
  Hello William
 
  i did new you are using drbd and i dont't know what type of
  configuration
  you using
 
  But it's better you try to start clvm with clvmd -d
 
  like thak we can see what it's the problem
 
  For what it's worth, here's the output of running clvmd -d on the node
  that
  stays up: http://pastebin.com/sWjaxAEF
 
  What's probably important in that big mass of output are the last two
  lines. Up
  to that point, I have both nodes up and running cman + clvmd;
  cluster.conf
  is
  here: http://pastebin.com/w5XNYyAX
 
  At the time of the next-to-the-last line, I cut power to the other
 node.
 
  At the time of the last line, I run vgdisplay on the remaining node,
  which
  hangs forever.
 
  After a lot of web searching, I found that I'm not the only one with
  this
  problem. Here's one case that doesn't seem relevant to me, since I
 don't
  use
  qdisk:
  
  http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
 .
  Here's one with the same problem with the same OS:
  http://bugs.centos.org/view.php?id=5229, but with no resolution.
 
  Out of curiosity, has anyone on this list made a two-node cman+clvmd
  cluster
  work for them?
 
  Il giorno 14 marzo 2012 14:02, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 6:02 AM, emmanuel segura wrote:
 
   I think it's better you make clvmd start at boot
 
  chkconfig cman on ; chkconfig clvmd on
 
 
  I've already tried it. It doesn't work. The problem is that my LVM
  information is on the drbd. If I start up clvmd before drbd, it
 won't
  find
  the logical volumes.
 
  I also don't see why that would make a difference (although this
 could
  be
  part of the confusion): a service is a service. I've tried starting
 up
  clvmd inside and outside pacemaker control, with the same problem.
 Why
  would starting clvmd at boot make a difference?
 
   Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
  columbia.edu selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 3/13/12 5:50 PM, emmanuel segura wrote:
 
   So if you using cman why you use lsb::clvmd
 
  I think you are very confused
 
 
  I don't dispute that I may be very confused!
 
  However, from what I can tell, I still need to run clvmd even if
  I'm running cman (I'm not using rgmanager). If I just run cman,
  gfs2 and any other form of mount fails. If I run cman, then clvmd,
  then gfs2, everything behaves normally.
 
  Going by these instructions:
 
  https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial
  https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
 
 
  the resources he puts under cluster control (rgmanager) I have
 to
  put under pacemaker control. Those include drbd, clvmd, and gfs2.
 
  The difference between what I've got, and what's in Clusters From
  Scratch, is in CFS they assign one DRBD volume to a single
  filesystem. I create an LVM

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura

Hello William

I don't see any error and the answer for your question it's yes

can you show me your /etc/cluster/cluster.conf and your crm configure show

like that more later i can try to look if i found some fix

Il giorno 15 marzo 2012 17:42, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:15 PM, emmanuel segura wrote:

  Ho did you created your volume group

 pvcreate /dev/drbd0
 vgcreate -c y ADMIN /dev/drbd0
 lvcreate -L 200G -n usr ADMIN # ... and so on
 # Nevis-HA is the cluster name I used in cluster.conf
 mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so on

  give me the output of vgs command when the cluster it's up

 Here it is:

Logging initialised at Thu Mar 15 12:40:39 2012
Set umask from 0022 to 0077
Finding all volume groups
Finding volume group ROOT
Finding volume group ADMIN
  VG#PV #LV #SN Attr   VSize   VFree
  ADMIN   1   5   0 wz--nc   2.61t 765.79g
  ROOT1   2   0 wz--n- 117.16g  0
Wiping internal VG cache

 I assume the c in the ADMIN attributes means that clustering is turned
 on?

  Il giorno 15 marzo 2012 17:06, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 11:50 AM, emmanuel segura wrote:
  yes william
 
  Now try clvmd -d and see what happen
 
  locking_type = 3 it's lvm cluster lock type
 
  Since you asked for confirmation, here it is: the output of 'clvmd -d'
  just now.
  http://pastebin.com/bne8piEw. I crashed the other node at Mar 15
  12:02:35,
  when you see the only additional line of output.
 
  I don't see any particular difference between this and the previous
 result
  http://pastebin.com/sWjaxAEF, which suggests that I had cluster
 locking
  enabled before, and still do now.
 
  Il giorno 15 marzo 2012 16:15, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 5:18 AM, emmanuel segura wrote:
 
  The first thing i seen in your clvmd log it's this
 
  =
   WARNING: Locking disabled. Be careful! This could corrupt your
  metadata.
  =
 
  I saw that too, and thought the same as you did. I did some checks
 (see
  below),
  but some web searches suggest that this message is a normal
 consequence
  of
  clvmd
  initialization; e.g.,
 
  http://markmail.org/message/vmy53pcv52wu7ghx
 
  use this command
 
  lvmconf --enable-cluster
 
  and remember for cman+pacemaker you don't need qdisk
 
  Before I tried your lvmconf suggestion, here was my /etc/lvm/lvm.conf:
  http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
  http://pastebin.com/rtw8c3Pf.
 
  Then I did as you suggested, but with a check to see if anything
  changed:
 
  # cd /etc/lvm/
  # cp lvm.conf lvm.conf.cluster
  # lvmconf --enable-cluster
  # diff lvm.conf lvm.conf.cluster
  #
 
  So the key lines have been there all along:
 locking_type = 3
 fallback_to_local_locking = 0
 
 
  Il giorno 14 marzo 2012 23:17, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 9:20 AM, emmanuel segura wrote:
  Hello William
 
  i did new you are using drbd and i dont't know what type of
  configuration
  you using
 
  But it's better you try to start clvm with clvmd -d
 
  like thak we can see what it's the problem
 
  For what it's worth, here's the output of running clvmd -d on the
 node
  that
  stays up: http://pastebin.com/sWjaxAEF
 
  What's probably important in that big mass of output are the last
 two
  lines. Up
  to that point, I have both nodes up and running cman + clvmd;
  cluster.conf
  is
  here: http://pastebin.com/w5XNYyAX
 
  At the time of the next-to-the-last line, I cut power to the other
  node.
 
  At the time of the last line, I run vgdisplay on the remaining
 node,
  which
  hangs forever.
 
  After a lot of web searching, I found that I'm not the only one with
  this
  problem. Here's one case that doesn't seem relevant to me, since I
  don't
  use
  qdisk:
  
 
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html
  .
  Here's one with the same problem with the same OS:
  http://bugs.centos.org/view.php?id=5229, but with no resolution.
 
  Out of curiosity, has anyone on this list made a two-node cman+clvmd
  cluster
  work for them?
 
  Il giorno 14 marzo 2012 14:02, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 6:02 AM, emmanuel segura wrote:
 
   I think it's better you make clvmd start at boot
 
  chkconfig cman on ; chkconfig clvmd on
 
 
  I've already tried it. It doesn't work. The problem is that my LVM
  information is on the drbd. If I start up clvmd before drbd, it
  won't
  find
  the logical volumes.
 
  I also don't see why that would make a difference (although this
  could
  be
  part of the confusion): a service is a service. I've tried
 starting
  up
  clvmd inside and outside pacemaker control, with the same problem.
  Why
  would starting clvmd at boot make

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-15 Thread emmanuel segura

Ok William

we can try to understand what happen when clvm hang

edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
uncomment this line

file = /var/log/lvm2.log

Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:55 PM, emmanuel segura wrote:

  I don't see any error and the answer for your question it's yes
 
  can you show me your /etc/cluster/cluster.conf and your crm configure
 show
 
  like that more later i can try to look if i found some fix

 Thanks for taking a look.

 My cluster.conf: http://pastebin.com/w5XNYyAX
 crm configure show: http://pastebin.com/atVkXjkn

 Before you spend a lot of time on the second file, remember that clvmd
 will hang
 whether or not I'm running pacemaker.

  Il giorno 15 marzo 2012 17:42, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 12:15 PM, emmanuel segura wrote:
 
  Ho did you created your volume group
 
  pvcreate /dev/drbd0
  vgcreate -c y ADMIN /dev/drbd0
  lvcreate -L 200G -n usr ADMIN # ... and so on
  # Nevis-HA is the cluster name I used in cluster.conf
  mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr  # ... and so
 on
 
  give me the output of vgs command when the cluster it's up
 
  Here it is:
 
 Logging initialised at Thu Mar 15 12:40:39 2012
 Set umask from 0022 to 0077
 Finding all volume groups
 Finding volume group ROOT
 Finding volume group ADMIN
   VG#PV #LV #SN Attr   VSize   VFree
   ADMIN   1   5   0 wz--nc   2.61t 765.79g
   ROOT1   2   0 wz--n- 117.16g  0
 Wiping internal VG cache
 
  I assume the c in the ADMIN attributes means that clustering is turned
  on?
 
  Il giorno 15 marzo 2012 17:06, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 11:50 AM, emmanuel segura wrote:
  yes william
 
  Now try clvmd -d and see what happen
 
  locking_type = 3 it's lvm cluster lock type
 
  Since you asked for confirmation, here it is: the output of 'clvmd -d'
  just now. http://pastebin.com/bne8piEw. I crashed the other node at
  Mar 15 12:02:35, when you see the only additional line of output.
 
  I don't see any particular difference between this and the previous
  result http://pastebin.com/sWjaxAEF, which suggests that I had
  cluster locking enabled before, and still do now.
 
  Il giorno 15 marzo 2012 16:15, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/15/12 5:18 AM, emmanuel segura wrote:
 
  The first thing i seen in your clvmd log it's this
 
  =
   WARNING: Locking disabled. Be careful! This could corrupt your
 metadata.
  =
 
  I saw that too, and thought the same as you did. I did some checks
  (see below), but some web searches suggest that this message is a
  normal consequence of clvmd initialization; e.g.,
 
  http://markmail.org/message/vmy53pcv52wu7ghx
 
  use this command
 
  lvmconf --enable-cluster
 
  and remember for cman+pacemaker you don't need qdisk
 
  Before I tried your lvmconf suggestion, here was my
 /etc/lvm/lvm.conf:
  http://pastebin.com/841VZRzW and the output of lvm dumpconfig:
  http://pastebin.com/rtw8c3Pf.
 
  Then I did as you suggested, but with a check to see if anything
  changed:
 
  # cd /etc/lvm/
  # cp lvm.conf lvm.conf.cluster
  # lvmconf --enable-cluster
  # diff lvm.conf lvm.conf.cluster
  #
 
  So the key lines have been there all along:
 locking_type = 3
 fallback_to_local_locking = 0
 
 
  Il giorno 14 marzo 2012 23:17, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 9:20 AM, emmanuel segura wrote:
  Hello William
 
  i did new you are using drbd and i dont't know what type of
  configuration you using
 
  But it's better you try to start clvm with clvmd -d
 
  like thak we can see what it's the problem
 
  For what it's worth, here's the output of running clvmd -d on
  the node that stays up: http://pastebin.com/sWjaxAEF
 
  What's probably important in that big mass of output are the
  last two lines. Up to that point, I have both nodes up and
  running cman + clvmd; cluster.conf is here:
  http://pastebin.com/w5XNYyAX
 
  At the time of the next-to-the-last line, I cut power to the
  other node.
 
  At the time of the last line, I run vgdisplay on the
  remaining node, which hangs forever.
 
  After a lot of web searching, I found that I'm not the only one
  with this problem. Here's one case that doesn't seem relevant
  to me, since I don't use qdisk:
  
 http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html.
  Here's one with the same problem with the same OS:
  http://bugs.centos.org/view.php?id=5229, but with no
 resolution.
 
  Out of curiosity, has anyone on this list made a two-node
  cman+clvmd cluster work for them?
 
  Il giorno 14 marzo 2012 14:02, William Seligman 
  selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/14/12 6:02 AM, emmanuel

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread emmanuel segura

Hello William

I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on

Il giorno 13 marzo 2012 23:29, William Seligman selig...@nevis.columbia.edu
ha scritto:

On 3/13/12 5:50 PM, emmanuel segura wrote:

So if you using cman why you use lsb::clvmd

I think you are very confused

I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if I'm
running
cman (I'm not using rgmanager). If I just run cman, gfs2 and any other
form of
mount fails. If I run cman, then clvmd, then gfs2, everything behaves
normally.

Going by these instructions:

https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

the resources he puts under cluster control (rgmanager) I have to put
under
pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in Clusters From
Scratch, is
in CFS they assign one DRBD volume to a single filesystem. I create an LVM
physical volume on my DRBD resource, as in the above tutorial, and so I
have to
start clvmd or the logical volumes in the DRBD partition won't be
recognized.

Is there some way to get logical volumes recognized automatically by cman
without rgmanager that I've missed?

Il giorno 13 marzo 2012 22:42, William Seligman
selig...@nevis.columbia.edu
ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:
I'm not sure if this is a Linux-HA question; please direct me to the
appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
Clusters From Scratch. Fencing is through forcibly rebooting a node
by
cutting and restoring its power via UPS.

My fencing/failover tests have revealed a problem. If I gracefully turn
off one node (crm node standby; service pacemaker stop; shutdown
-r
now) all the resources transfer to the other node with no problems.
If I
cut power to one node (as would happen if it were fenced), the
lsb::clvmd
resource on the remaining node eventually fails. Since all the other
resources depend on clvmd, all the resources on the remaining node stop
and the cluster is left with nothing running.

I've traced why the lsb::clvmd fails: The monitor/status command
includes vgdisplay, which hangs indefinitely. Therefore the monitor
will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm: If a node
is
cut off, the cluster isn't handling it properly. Has anyone on this
list
seen this before? Any ideas?
Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages

http://pastebin.com/uqC6bc1b

It looks like what's happening is that the fence agent (one I wrote) is
not returning the proper error code when a node crashes. According to
this
page, if a fencing agent fails GFS2 will freeze to protect the data:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html

As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did its job
and
CMAN was notified of the fencing: http://pastebin.com/jaH820Bv.

Unfortunately, I still got the gfs2 freeze, so this is not the complete
story.

--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-14 Thread emmanuel segura

Hello William

i did new you are using drbd and i dont't know what type of configuration
you using

But it's better you try to start clvm with clvmd -d

like thak we can see what it's the problem

Il giorno 14 marzo 2012 14:02, William Seligman selig...@nevis.columbia.edu
ha scritto:

On 3/14/12 6:02 AM, emmanuel segura wrote:

I think it's better you make clvmd start at boot

chkconfig cman on ; chkconfig clvmd on

I've already tried it. It doesn't work. The problem is that my LVM
information is on the drbd. If I start up clvmd before drbd, it won't find
the logical volumes.

I also don't see why that would make a difference (although this could be
part of the confusion): a service is a service. I've tried starting up
clvmd inside and outside pacemaker control, with the same problem. Why
would starting clvmd at boot make a difference?

Il giorno 13 marzo 2012 23:29, William Seligmanseligman@nevis.**
columbia.edu selig...@nevis.columbia.edu

ha scritto:

On 3/13/12 5:50 PM, emmanuel segura wrote:

So if you using cman why you use lsb::clvmd

I think you are very confused

I don't dispute that I may be very confused!

However, from what I can tell, I still need to run clvmd even if
I'm running cman (I'm not using rgmanager). If I just run cman,
gfs2 and any other form of mount fails. If I run cman, then clvmd,
then gfs2, everything behaves normally.

Going by these instructions:

https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorialhttps://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial

the resources he puts under cluster control (rgmanager) I have to
put under pacemaker control. Those include drbd, clvmd, and gfs2.

The difference between what I've got, and what's in Clusters From
Scratch, is in CFS they assign one DRBD volume to a single
filesystem. I create an LVM physical volume on my DRBD resource,
as in the above tutorial, and so I have to start clvmd or the
logical volumes in the DRBD partition won't be recognized. Is
there some way to get logical volumes recognized automatically by
cman without rgmanager that I've missed?

Il giorno 13 marzo 2012 22:42, William Seligman

selig...@nevis.columbia.edu

ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:

I'm not sure if this is a Linux-HA question; please direct
me to the appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as
described in Clusters From Scratch. Fencing is through
forcibly rebooting a node by cutting and restoring its power
via UPS.

My fencing/failover tests have revealed a problem. If I
gracefully turn off one node (crm node standby; service
pacemaker stop; shutdown -r now) all the resources
transfer to the other node with no problems. If I cut power
to one node (as would happen if it were fenced), the
lsb::clvmd resource on the remaining node eventually fails.
Since all the other resources depend on clvmd, all the
resources on the remaining node stop and the cluster is left
with nothing running.

I've traced why the lsb::clvmd fails: The monitor/status
command includes vgdisplay, which hangs indefinitely.
Therefore the monitor will always time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm:
If a node is cut off, the cluster isn't handling it properly.
Has anyone on this list seen this before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd)** /var/log/messages

http://pastebin.com/uqC6bc1b

It looks like what's happening is that the fence agent (one I
wrote) is not returning the proper error code when a node
crashes. According to this page, if a fencing agent fails GFS2
will freeze to protect the data:

http://docs.redhat.com/docs/**en-US/Red_Hat_Enterprise_**
Linux/6/html/Global_File_**System_2/s1-gfs2hand-allnodes.**htmlhttp://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html

As a test, I tried to fence my test node via standard means:

stonith_admin -F
orestes-corosync.nevis.**columbia.eduhttp://orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did
its job and CMAN was notified of the
fencing:http://pastebin.com/**jaH820Bv http://pastebin.com/jaH820Bv
.

Unfortunately, I still got the gfs2 freeze, so this is not the
complete story.

--
Bill Seligman |
mailto://seligman@nevis.**columbia.eduselig...@nevis.columbia.edu
Nevis Labs, Columbia Univ |
http

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread emmanuel segura

Sorry Willian

But i think clvmd it must be used with

ocf:lvm2:clvmd

esample

crm confgiure
primitive clvmd  ocf:lvm2:clvmd params daemon_timeout=30

clone cln_clvmd clvmd

and rember clvmd depend on dlm, so for the dlm you sould same



Il giorno 13 marzo 2012 17:29, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 I'm not sure if this is a Linux-HA question; please direct me to the
 appropriate list if it's not.

 I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
 Clusters
 From Scratch. Fencing is through forcibly rebooting a node by cutting and
 restoring its power via UPS.

 My fencing/failover tests have revealed a problem. If I gracefully turn
 off one
 node (crm node standby; service pacemaker stop; shutdown -r now) all
 the
 resources transfer to the other node with no problems. If I cut power to
 one
 node (as would happen if it were fenced), the lsb::clvmd resource on the
 remaining node eventually fails. Since all the other resources depend on
 clvmd,
 all the resources on the remaining node stop and the cluster is left with
 nothing running.

 I've traced why the lsb::clvmd fails: The monitor/status command includes
 vgdisplay, which hangs indefinitely. Therefore the monitor will always
 time-out.

 So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
 cut
 off, the cluster isn't handling it properly. Has anyone on this list seen
 this
 before? Any ideas?

 Details:

 versions:
 Redhat Linux 6.2 (kernel 2.6.32)
 cman-3.0.12.1
 corosync-1.4.1
 pacemaker-1.1.6
 lvm2-2.02.87
 lvm2-cluster-2.02.87

 cluster.conf: http://pastebin.com/w5XNYyAX
 output of crm configure show: http://pastebin.com/atVkXjkn
 output of lvm dumpconfig: http://pastebin.com/rtw8c3Pf

 /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log
 show
 nothing. When I shut down power to one nodes (orestes-tb), the output of
 grep -E (dlm|gfs2|clvmd) /var/log/messages is 
 http://pastebin.com/vjpvCFeN.

 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-13 Thread emmanuel segura

Hello Willian

So if you using cman why you use lsb::clvmd

I think you are very confused

Il giorno 13 marzo 2012 22:42, William Seligman selig...@nevis.columbia.edu
ha scritto:

On 3/13/12 12:29 PM, William Seligman wrote:
I'm not sure if this is a Linux-HA question; please direct me to the
appropriate list if it's not.

I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
Clusters
From Scratch. Fencing is through forcibly rebooting a node by cutting
and
restoring its power via UPS.

My fencing/failover tests have revealed a problem. If I gracefully turn
off one
node (crm node standby; service pacemaker stop; shutdown -r now)
all the
resources transfer to the other node with no problems. If I cut power to
one
node (as would happen if it were fenced), the lsb::clvmd resource on the
remaining node eventually fails. Since all the other resources depend on
clvmd,
all the resources on the remaining node stop and the cluster is left with
nothing running.

I've traced why the lsb::clvmd fails: The monitor/status command includes
vgdisplay, which hangs indefinitely. Therefore the monitor will always
time-out.

So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
cut
off, the cluster isn't handling it properly. Has anyone on this list
seen this
before? Any ideas?

Details:

versions:
Redhat Linux 6.2 (kernel 2.6.32)
cman-3.0.12.1
corosync-1.4.1
pacemaker-1.1.6
lvm2-2.02.87
lvm2-cluster-2.02.87

This may be a Linux-HA question after all!

I ran a few more tests. Here's the output from a typical test of

grep -E (dlm|gfs2}clvmd|fenc|syslogd) /var/log/messages

http://pastebin.com/uqC6bc1b

It looks like what's happening is that the fence agent (one I wrote) is not
returning the proper error code when a node crashes. According to this
page, if
a fencing agent fails GFS2 will freeze to protect the data:

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html

As a test, I tried to fence my test node via standard means:

stonith_admin -F orestes-corosync.nevis.columbia.edu

These were the log messages, which show that stonith_admin did its job and
CMAN
was notified of the fencing: http://pastebin.com/jaH820Bv.

Unfortunately, I still got the gfs2 freeze, so this is not the complete
story.

First things first. I vaguely recall a web page that went over the STONITH
return codes, but I can't locate it again. Is there any reference to the
return
codes expected from a fencing agent, perhaps as function of the state of
the
fencing device?
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137|
Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

2012-03-03 Thread emmanuel segura

are you sure the exportfs agent can be use it with clone active/active?

Il giorno 03 marzo 2012 00:12, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 One step forward, two steps back.

 I'm working on a two-node primary-primary cluster. I'm debugging problems
 I have
 with the ocf:heartbeat:exportfs resource. For some reason, pacemaker
 sometimes
 appears to ignore ordering I put on the resources.

 Florian Haas recommended pastebin in another thread, so let's give it a
 try.
 Here's my complete current output of crm configure show:

 http://pastebin.com/bbSsqyeu

 Here's a quick sketch: The sequence of events is supposed to be DRBD (ms)
 -
 clvmd (clone) - gfs2 (clone) - exportfs (clone).

 But that's not what happens. What happens is that pacemaker tries to start
 up
 the exportfs resource immediately. This fails, because what it's exporting
 doesn't exist until after gfs2 runs. Because the cloned resource can't run
 on
 either node, the cluster goes into a state in which one node is fenced, the
 other node refuses to run anything.

 Here's a quick snapshot I was able to take of the output of crm_mon that
 shows
 the problem:

 http://pastebin.com/CiZvS4Fh

 This shows that pacemaker is still trying to start the exportfs resources,
 before it has run the chain drbd-clvmd-gfs2.

 Just to confirm the obvious, I have the ordering constraints in the full
 configuration linked above (Admin is my DRBD resource):

 order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start
 order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone
 order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone

 This is not the only time I've observed this behavior in pacemaker. Here's
 a
 lengthy log file excerpt from the same time I took the crm_mon snapshot:

 http://pastebin.com/HwMUCmcX

 I can see that other resources, the symlink ones in particular, are being
 probed
 and started before the drbd Admin resource has a chance to be promoted. In
 looking at the log file, it may help to know that /mail and /var/nevis are
 gfs2
 partitions that aren't mounted until the Gfs2 resource starts.

 So this isn't the first time I've seen this happen. This is just the first
 time
 I've been able to reproduce this reliably and capture a snapshot.

 Any ideas?
 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread emmanuel segura

can you show me your /etc/cluster/cluster.conf?

because i think your problem it's a fencing-loop

Il giorno 01 marzo 2012 01:03, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 2/28/12 7:26 PM, Lars Ellenberg wrote:
  On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
  off-topic
  Sigh. I wish that were the reason.
 
  The reason why I'm doing dual-primary is that I've a got a
 single-primary
  two-node cluster in production that simply doesn't work. One node runs
  resources; the other sits and twiddles its fingers; fine. But when
 primary goes
  down, secondary has trouble starting up all the resources; when we've
 actually
  had primary failures (UPS goes haywire, hard drive failure) the
 secondary often
  winds up in a state in which it runs none of the significant resources.
 
  With the dual-primary setup I have now, both machines are running the
 resources
  that typically cause problems in my single-primary configuration. If
 one box
  goes down, the other doesn't have to failover anything; it's already
 running
  them. (I needed IPaddr2 cloning to work properly for this to work,
 which is why
  I started that thread... and all the stupider of me for missing that
 crucial
  page in Clusters From Scratch.)
 
  My only remaining problem with the configuration is restoring a fenced
 node to
  the cluster. Hence my tests, and the reason why I started this thread.
  /off-topic
 
  Uhm, I do think that is exactly on topic.
 
  Rather fix your resources to be able to successfully take over,
  than add even more complexity.
 
  What resources would that be,
  and why are they not taking over?

 I can't tell you in detail, because the major snafu happened on a
 production
 system after a power outage a few months ago. My goal was to get the thing
 stable as quickly as possible. In the end, that turned out to be a non-HA
 configuration: One runs corosync+pacemaker+drbd, while the other just runs
 drbd.
 It works, in the sense that the users get their e-mail. If there's a power
 outage, I have to bring things up manually.

 So my only reference is the test-bench dual-primary setup I've got now,
 which is
 exhibiting the same kinds of problems even though the OS versions, software
 versions, and layout are different. This suggests that the problem lies in
 the
 way I'm setting up the configuration.

 The problems I have seem to be in the general category of the 'good guy'
 gets
 fenced when the 'bad guy' gets into trouble. Examples:

 - Assuming I start out with two crashed nodes. If I just start up DRBD and
 nothing else, the partitions sync quickly with no problems.

 - If the system starts with cman running, and I start drbd, it's likely
 that
 system who is _not_ Outdated will be fenced (rebooted). Same thing if
 cman+pacemaker is running.

 - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
 (which
 is why I tried making changes to that resource script). Assume I start
 with one
 node running cman+pacemaker, and the other stopped. I turned on the stopped
 node. This will typically result in the running node being fenced, because
 it
 has it times out when stopping the exportfs resource.

 Falling back to DRBD 8.3.12 didn't change this behavior.

 My pacemaker configuration is long, so I'll excerpt what I think are the
 relevant pieces in the hope that it will be enough for someone to say You
 fool!
 This is covered in Pacemaker Explained page 56! When bringing up a stopped
 node, in order to restart AdminClone pacemaker wants to stop ExportsClone,
 then
 Gfs2Clone, then ClvmdClone. As I said, it's the failure to stop ExportMail
 on
 the running node that causes it to be fenced.

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op monitor interval=59s role=Slave \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240
 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 \
clone-max=2 clone-node-max=1 notify=true

 primitive Clvmd lsb:clvmd op monitor interval=30s
 clone ClvmdClone Clvmd
 colocation Clvmd_With_Admin inf: ClvmdClone AdminClone:Master
 order Admin_Before_Clvmd inf: AdminClone:promote ClvmdClone:start

 primitive Gfs2 lsb:gfs2 op monitor interval=30s
 clone Gfs2Clone Gfs2
 colocation Gfs2_With_Clvmd inf: Gfs2Clone ClvmdClone
 order Clvmd_Before_Gfs2 inf: ClvmdClone Gfs2Clone

 primitive ExportMail ocf:heartbeat:exportfs \
op start interval=0 timeout=40 \
op stop interval=0 timeout=45 \
params clientspec=mail directory=/mail fsid=30
 clone ExportsClone ExportMail
 colocation Exports_With_Gfs2 inf: ExportsClone Gfs2Clone
 order Gfs2_Before_Exports inf: Gfs2Clone ExportsClone

 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA|

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread emmanuel segura

try to change the fence daemon tag like this

 fence_daemon clean_start=1 post_join_delay=30 /

change your cluster config version and after reboot the cluster

Il giorno 01 marzo 2012 12:28, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/1/12 4:15 AM, emmanuel segura wrote:

 can you show me your /etc/cluster/cluster.conf?

 because i think your problem it's a fencing-loop


 Here it is:

 /etc/cluster/cluster.conf:

 ?xml version=1.0?
 cluster config_version=17 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode 
 name=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu
 nodeid=1
  altname 
 name=hypatia-private.nevis.**columbia.eduhttp://hypatia-private.nevis.columbia.edu
 port=5405
 mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk 
 port=hypatia-tb.nevis.**columbia.eduhttp://hypatia-tb.nevis.columbia.edu
 /
/method
  /fence
/clusternode
clusternode 
 name=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu
 nodeid=2
  altname 
 name=orestes-private.nevis.**columbia.eduhttp://orestes-private.nevis.columbia.edu
 port=5405
 mcast=226.94.1.1/
  fence
method name=pcmk-redirect
  device name=pcmk 
 port=orestes-tb.nevis.**columbia.eduhttp://orestes-tb.nevis.columbia.edu
 /
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  fence_daemon post_join_delay=30 /
  rm disabled=1 /
 /cluster



  Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.**
 columbia.edu selig...@nevis.columbia.edu

 ha scritto:


  On 2/28/12 7:26 PM, Lars Ellenberg wrote:

 On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:

 off-topic
 Sigh. I wish that were the reason.

 The reason why I'm doing dual-primary is that I've a got a

 single-primary

 two-node cluster in production that simply doesn't work. One node runs
 resources; the other sits and twiddles its fingers; fine. But when

 primary goes

 down, secondary has trouble starting up all the resources; when we've

 actually

 had primary failures (UPS goes haywire, hard drive failure) the

 secondary often

 winds up in a state in which it runs none of the significant resources.

 With the dual-primary setup I have now, both machines are running the

 resources

 that typically cause problems in my single-primary configuration. If

 one box

 goes down, the other doesn't have to failover anything; it's already

 running

 them. (I needed IPaddr2 cloning to work properly for this to work,

 which is why

 I started that thread... and all the stupider of me for missing that

 crucial

 page in Clusters From Scratch.)

 My only remaining problem with the configuration is restoring a fenced

 node to

 the cluster. Hence my tests, and the reason why I started this thread.
 /off-topic


 Uhm, I do think that is exactly on topic.

 Rather fix your resources to be able to successfully take over,
 than add even more complexity.

 What resources would that be,
 and why are they not taking over?


 I can't tell you in detail, because the major snafu happened on a
 production
 system after a power outage a few months ago. My goal was to get the
 thing
 stable as quickly as possible. In the end, that turned out to be a non-HA
 configuration: One runs corosync+pacemaker+drbd, while the other just
 runs
 drbd.
 It works, in the sense that the users get their e-mail. If there's a
 power
 outage, I have to bring things up manually.

 So my only reference is the test-bench dual-primary setup I've got now,
 which is
 exhibiting the same kinds of problems even though the OS versions,
 software
 versions, and layout are different. This suggests that the problem lies
 in
 the
 way I'm setting up the configuration.

 The problems I have seem to be in the general category of the 'good guy'
 gets
 fenced when the 'bad guy' gets into trouble. Examples:

 - Assuming I start out with two crashed nodes. If I just start up DRBD
 and
 nothing else, the partitions sync quickly with no problems.

 - If the system starts with cman running, and I start drbd, it's likely
 that
 system who is _not_ Outdated will be fenced (rebooted). Same thing if
 cman+pacemaker is running.

 - Cloned ocf:heartbeat:exportfs resources are giving me problems as well
 (which
 is why I tried making changes to that resource script). Assume I start
 with one
 node running cman+pacemaker, and the other stopped. I turned on the
 stopped
 node. This will typically result in the running node being fenced,
 because
 it
 has it times out when stopping the exportfs resource.

 Falling back to DRBD 8.3.12 didn't change this behavior.

 My pacemaker configuration is long, so I'll excerpt what I think are the
 relevant pieces in the hope that it will be enough for someone

Re: [Linux-HA] cman+pacemaker+drbd fencing problem

2012-03-01 Thread emmanuel segura

Ok william

if this it'sn the problem, when you show me your pacemaker cib xml

crm configure show  OUTPUT

Il giorno 01 marzo 2012 18:10, William Seligman selig...@nevis.columbia.edu
 ha scritto:

 On 3/1/12 6:34 AM, emmanuel segura wrote:
  try to change the fence daemon tag like this
  
   fence_daemon clean_start=1 post_join_delay=30 /
  
  change your cluster config version and after reboot the cluster

 This did not change the behavior of the cluster. In particular, I'm still
 dealing with this:

  - If the system starts with cman running, and I start drbd, it's
  likely that system who is _not_ Outdated will be fenced (rebooted).

  Il giorno 01 marzo 2012 12:28, William Seligman 
 selig...@nevis.columbia.edu
  ha scritto:
 
  On 3/1/12 4:15 AM, emmanuel segura wrote:
 
  can you show me your /etc/cluster/cluster.conf?
 
  because i think your problem it's a fencing-loop
 
 
  Here it is:
 
  /etc/cluster/cluster.conf:
 
  ?xml version=1.0?
  cluster config_version=17 name=Nevis_HA
   logging debug=off/
   cman expected_votes=1 two_node=1 /
   clusternodes
 clusternode name=hypatia-tb.nevis.**columbia.edu
 http://hypatia-tb.nevis.columbia.edu
  nodeid=1
   altname name=hypatia-private.nevis.**columbia.edu
 http://hypatia-private.nevis.columbia.edu
  port=5405
  mcast=226.94.1.1/
   fence
 method name=pcmk-redirect
   device name=pcmk port=hypatia-tb.nevis.**columbia.edu
 http://hypatia-tb.nevis.columbia.edu
  /
 /method
   /fence
 /clusternode
 clusternode name=orestes-tb.nevis.**columbia.edu
 http://orestes-tb.nevis.columbia.edu
  nodeid=2
   altname name=orestes-private.nevis.**columbia.edu
 http://orestes-private.nevis.columbia.edu
  port=5405
  mcast=226.94.1.1/
   fence
 method name=pcmk-redirect
   device name=pcmk port=orestes-tb.nevis.**columbia.edu
 http://orestes-tb.nevis.columbia.edu
  /
 /method
   /fence
 /clusternode
   /clusternodes
   fencedevices
 fencedevice name=pcmk agent=fence_pcmk/
   /fencedevices
   fence_daemon post_join_delay=30 /
   rm disabled=1 /
  /cluster
 
 
 
   Il giorno 01 marzo 2012 01:03, William Seligmanseligman@nevis.**
  columbia.edu selig...@nevis.columbia.edu
 
  ha scritto:
 
 
   On 2/28/12 7:26 PM, Lars Ellenberg wrote:
 
  On Tue, Feb 28, 2012 at 03:51:29PM -0500, William Seligman wrote:
 
  off-topic
  Sigh. I wish that were the reason.
 
  The reason why I'm doing dual-primary is that I've a got a
  single-primary two-node cluster in production that simply doesn't
  work. One node runs resources; the other sits and twiddles its
  fingers; fine. But when primary goes down, secondary has trouble
  starting up all the resources; when we've actually had primary
  failures (UPS goes haywire, hard drive failure) the secondary often
  winds up in a state in which it runs none of the significant
  resources.
 
  With the dual-primary setup I have now, both machines are running
  the resources that typically cause problems in my single-primary
  configuration. If one box goes down, the other doesn't have to
  failover anything; it's already running them. (I needed IPaddr2
  cloning to work properly for this to work, which is why I started
  that thread... and all the stupider of me for missing that crucial
  page in Clusters From Scratch.)
 
  My only remaining problem with the configuration is restoring a
  fenced node to the cluster. Hence my tests, and the reason why I
  started this thread.
  /off-topic

 
 
  Uhm, I do think that is exactly on topic.
 
  Rather fix your resources to be able to successfully take over,
  than add even more complexity.
 
  What resources would that be,
  and why are they not taking over?
 
 
  I can't tell you in detail, because the major snafu happened on a
  production system after a power outage a few months ago. My goal was
 to
  get the thing stable as quickly as possible. In the end, that turned
  out to be a non-HA configuration: One runs corosync+pacemaker+drbd,
  while the other just runs drbd. It works, in the sense that the users
  get their e-mail. If there's a power outage, I have to bring things up
  manually.
 
  So my only reference is the test-bench dual-primary setup I've got
  now, which is exhibiting the same kinds of problems even though the OS
  versions, software versions, and layout are different. This suggests
  that the problem lies in the way I'm setting up the configuration.
 
  The problems I have seem to be in the general category of the 'good
  guy' gets fenced when the 'bad guy' gets into trouble. Examples:
 
  - Assuming I start out with two crashed nodes. If I just start up DRBD
  and nothing else, the partitions sync quickly with no problems.
 
  - If the system starts with cman running, and I start drbd, it's
  likely that system who is _not_ Outdated will be fenced (rebooted).
  Same thing if cman+pacemaker is running

Re: [Linux-HA] Pacemaker - Resources not staying together

2012-02-11 Thread emmanuel segura

colocation altogether inf:  apache mysql drbd_fs drbd_ms:Master

2012/2/10 Ryan Stepalavich rstepalav...@gmail.com

 I'm using Pacemaker to handle my cluster resources (on top of heartbeat).
 Everything works except the collocation parameter. I want all of my
 resources to stay on the same node at all times.

 Here's what my cluster looks like right now:

 ---
 
 Last updated: Fri Feb 10 16:52:10 2012
 Stack: Heartbeat
 Current DC: svrmntr01 (715d1b92-3849-4dab-8d4a-a3b3a4f4efc3) - partition
 with quorum
 Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
 2 Nodes configured, unknown expected votes
 4 Resources configured.
 

 Online: [ svrmntr01 svrmntr02 ]

  Master/Slave Set: drbd_ms [drbd]
 Masters: [ svrmntr01 ]
 Slaves: [ svrmntr02 ]
 drbd_fs (ocf::heartbeat:Filesystem):Started svrmntr01
 apache  (lsb:apache2):  Started svrmntr02 (unmanaged) FAILED
 mysql   (lsb:mysql) Started [   svrmntr01   svrmntr02 ]

 Failed actions:
apache_monitor_0 (node=svrmntr02, call=4, rc=127, status=complete):
 unknown
apache_stop_0 (node=svrmntr02, call=6, rc=127, status=complete):
 unknown
 -

 Here's my current configuration:

 -
 node $id=715d1b92-3849-4dab-8d4a-a3b3a4f4efc3 svrmntr01
 node $id=af6fe9bc-b89d-4460-9b50-3039bbd9e144 svrmntr02 \
attributes standby=off
 primitive apache lsb:apache2 \
meta target-role=Started
 primitive drbd ocf:linbit:drbd \
params drbd_resource=lamp \
op monitor interval=60s
 primitive drbd_fs ocf:heartbeat:Filesystem \
params device=/dev/drbd0 directory=/srv/data fstype=ext4
 primitive mysql lsb:mysql
 ms drbd_ms drbd \
meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 location cli-standby-apache apache \
rule $id=cli-standby-rule-apache -inf: #uname eq svrmntr02
 colocation altogether inf: drbd_fs drbd_ms:Master apache mysql
 order fs_after_drbd_then_lamp inf: drbd_ms:promote drbd_fs:start
 mysql:start apache:start
 property $id=cib-bootstrap-options \
dc-version=1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=Heartbeat \
stonith-enabled=false
 


 Anybody have an idea as to what I'm doing wrong?

 Thanks!
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Status about ocfs2.pcmk ?

2012-02-02 Thread emmanuel segura

Sorry But can we see your configuration?

2012/2/2 alain.mou...@bull.net

 Hi
 Don't remember in details, it was at the end of 2010 ...  but :
 Why pacemaker is fencing a node ?
 because it was one of my simple HA test : for example, make the heartbeat
 no more working so
 that pacemaker fences the node, and it did it for sure, but on the
 remaining node, sometimes ocfs2 crashes
 the node, or the FS ocfs2 was passing in read-only ...
 I did simple HA tests, such as this one above,  to check the robustness of
 the configuration  and many times,
 it failed and make the HA cluster down.
 That's why I ask for status today.

 Alain



 De :Ulrich Windl ulrich.wi...@rz.uni-regensburg.de
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  02/02/2012 16:01
 Objet : Re: [Linux-HA] Antw:  Status about ocfs2.pcmk ?
 Envoyé par :linux-ha-boun...@lists.linux-ha.org



  alain.mou...@bull.net schrieb am 02.02.2012 um 16:05 in Nachricht
 of5716529b.0ca482e6-onc1257998.0050e240-c1257998.00518...@bull.net:
  Hi
  Thanks.
  OK but I also could mount the FS , the problems in 2010 on RHEL was that

  the configuration was not robust , meaning that during validation tests,

  there were often cases where both nodes were dead : one fenced by
  Pacemaker and the other killed itself by ocfs2, or problems on mount
  read-only FS after failover , etc.

 Hi!

 I'd start inspecting the logs: Why is pacemaker fencing a node? Why is the
 filesytem read-only?

 Regards,
 Ulrich


  Alain
 
 
 
  De :Ulrich Windl ulrich.wi...@rz.uni-regensburg.de
  A : linux-ha@lists.linux-ha.org
  Date :  02/02/2012 15:33
  Objet : [Linux-HA] Antw:  Status about ocfs2.pcmk ?
  Envoyé par :linux-ha-boun...@lists.linux-ha.org
 
 
 
  Hi!
  I have something running using OCFS on SLES11 SP1:
 
  ocf:pacemaker:controld
  ocf:ocfs2:o2cb
 
  At least I could mount the filesystem with it:
  /dev/drbd_r0 on /exports/ocfs/samba type ocfs2
  (rw,_netdev,acl,cluster_stack=pcmk)
 
  Regards,
  Ulrich
 
  alain.mou...@bull.net schrieb am 02.02.2012 um 14:54 in Nachricht
  offed8db74.9970c9e4-onc1257998.004a64db-c1257998.004b1...@bull.net:
   Hi
  
   Just wonder if someone has succeded to configured a working HA
   configuration with Pacemaker/corosync
   and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly
 (and
   eventually SLES11) ?
  
   (I tried at the end of 2010 but gave up after a few weeks because it
 was
 
   not working at all)
  
   Thanks if someone can give a status?
   Regards
   Alain Moullé
   ___
   Linux-HA mailing list
   Linux-HA@lists.linux-ha.org
   http://lists.linux-ha.org/mailman/listinfo/linux-ha
   See also: http://linux-ha.org/ReportingProblems
  
 
 
 
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
  ___
  Linux-HA mailing list
  Linux-HA@lists.linux-ha.org
  http://lists.linux-ha.org/mailman/listinfo/linux-ha
  See also: http://linux-ha.org/ReportingProblems
 



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread emmanuel segura

William try to follow the suggestion of Arnold

In my case it's different because we don't use drbd we are using SAN with
ocfs2

But i think for drbd in dual primary you need the attribute master-max=2

2012/1/31 William Seligman selig...@nevis.columbia.edu

 On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote:

  On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote:

  But if you wanna implement dual primary i think you don't nee promote
 for
  your drbd
  Try to use clone without master/slave

  At least when you use the linbit-ra, using it without a master-clone
 will give
  you one(!) slave only. When you use a normal clone with two clones, you
 will
  get two slaves. The RA only goes primary on promote, that is when its
 in
  master-state. = You need a master-clone of two clones with 1-2 masters
 to use
  drbd in the cluster.

 If I understand Emmanual's suggestion: The only way I know how to
 implement this
 is to create a simple clone group with lsb::drbd instead of Linbit's drbd
 resource, and put become-primary-on for both my nodes in drbd.conf.

 This might work in the short term, but I think it's risky in the long
 term. For
 example: Something goes wrong and node A stoniths node B. I bring node B
 back
 up, disabling cman+pacemaker before I do so, and want to re-sync node B's
 DRBD
 partition with A. If I'm stupid (occupational hazard), I won't remember to
 edit
 drbd.conf before I do this, node B will automatically try to become
 primary, and
 probably get stonith'ed again.


 Arnold: I thought that was what I was doing with these statements:

 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240

 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1


 That is, master-max=2 means to promote two instances to master. Did I
 get it
 wrong?
 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-31 Thread emmanuel segura

William can you try like this

primitive AdminDrbd ocf:linbit:drbd \
  params drbd_resource=admin \
  op monitor interval=60s role=Master

clone Adming AdminDrbd

2012/1/31 William Seligman selig...@nevis.columbia.edu

 On 1/31/12 3:47 PM, emmanuel segura wrote:

  William try to follow the suggestion of Arnold
 
  In my case it's different because we don't use drbd we are using SAN with
  ocfs2
 
  But i think for drbd in dual primary you need the attribute
 master-max=2

 I did, or thought I did. Have I missed something? Again, from crm
 configure show:

 primitive AdminDrbd ocf:linbit:drbd \
   params drbd_resource=admin \
   op monitor interval=60s role=Master \
op monitor interval=59s role=Slave \
op stop interval=0 timeout=320 \
   op start interval=0 timeout=240

 ms AdminClone AdminDrbd \
   meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1

 Still no promotion to primary on either node.

 
  2012/1/31 William Seligman selig...@nevis.columbia.edu
 
  On Tue, 31 Jan 2012 00:36:23 Arnold Krille wrote:
 
  On Tuesday 31 January 2012 00:12:52 emmanuel segura wrote:
 
  But if you wanna implement dual primary i think you don't nee promote
  for your drbd Try to use clone without master/slave
 
  At least when you use the linbit-ra, using it without a master-clone
 will
  give you one(!) slave only. When you use a normal clone with two
 clones,
  you will get two slaves. The RA only goes primary on promote, that is
  when its in master-state. = You need a master-clone of two clones with
  1-2 masters to use drbd in the cluster.
 
  If I understand Emmanual's suggestion: The only way I know how to
 implement
  this is to create a simple clone group with lsb::drbd instead of
 Linbit's
  drbd resource, and put become-primary-on for both my nodes in
 drbd.conf.
 
  This might work in the short term, but I think it's risky in the long
 term.
  For example: Something goes wrong and node A stoniths node B. I bring
 node
  B back up, disabling cman+pacemaker before I do so, and want to re-sync
  node B's DRBD partition with A. If I'm stupid (occupational hazard), I
  won't remember to edit drbd.conf before I do this, node B will
  automatically try to become primary, and probably get stonith'ed
 again.
 
  Arnold: I thought that was what I was doing with these statements:
 
  primitive AdminDrbd ocf:linbit:drbd \
 params drbd_resource=admin \
 op monitor interval=60s role=Master \
 op stop interval=0 timeout=320 \
 op start interval=0 timeout=240
 
  ms AdminClone AdminDrbd \
 meta master-max=2 master-node-max=1 clone-max=2
  clone-node-max=1
 
 
  That is, master-max=2 means to promote two instances to master. Did I
 get
  it wrong?

 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cman+pacemaker+dual-primary drbd does not promote

2012-01-30 Thread emmanuel segura

Sorry William

But if you wanna implement dual primary i think you don't nee promote for
your drbd
Try to use clone without master/slave

2012/1/30 William Seligman selig...@nevis.columbia.edu

 I'm trying to follow the directions for setting up a dual-primary DRBD
 setup
 with CMAN and Pacemaker. I'm stuck at an annoying spot: Pacemaker won't
 promote
 the DRBD resources to primary at either node.


 Here's the result of crm_mon:

 Last updated: Mon Jan 30 17:07:03 2012
 Stack: cman
 Current DC: hypatia-tb - partition with quorum
 Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
 2 Nodes configured, unknown expected votes
 2 Resources configured.
 

 Online: [ orestes-tb hypatia-tb ]

  Master/Slave Set: AdminClone [AdminDrbd]
 Slaves: [ hypatia-tb orestes-tb ]



 /etc/cluster/cluster.conf:

 cluster config_version=6 name=Nevis_HA
  logging debug=off/
  cman expected_votes=1 two_node=1 /
  clusternodes
clusternode name=hypatia-tb nodeid=1
  fence
method name=pcmk-redirect
  device name=pcmk port=hypatia-tb/
/method
  /fence
/clusternode
clusternode name=orestes-tb nodeid=2
  fence
method name=pcmk-redirect
  device name=pcmk port=orestes-tb/
/method
  /fence
/clusternode
  /clusternodes
  fencedevices
fencedevice name=pcmk agent=fence_pcmk/
  /fencedevices
  !-- fence_daemon post_join_delay=30 / --
 /cluster


 crm configure show:

 node hypatia-tb
 node orestes-tb
 primitive AdminDrbd ocf:linbit:drbd \
params drbd_resource=admin \
op monitor interval=60s role=Master \
op stop interval=0 timeout=320 \
op start interval=0 timeout=240
 primitive Clvmd lsb:clvmd
 ms AdminClone AdminDrbd \
meta master-max=2 master-node-max=1 clone-max=2
 clone-node-max=1
 notify=true
 clone ClvmdClone Clvmd
 colocation ClvmdWithAdmin inf: ClvmdClone AdminClone:Master
 order AdminBeforeClvmd inf: AdminClone:promote ClvmdClone:start
 property $id=cib-bootstrap-options \
dc-version=1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=cman \
stonith-enabled=false


 DRBD looks OK:

 # cat /proc/drbd
 version: 8.4.0 (api:1/proto:86-100)
 GIT-hash: 28753f559ab51b549d16bcf487fe625d5919c49c build by gardner@,
 2012-01-25
 19:10:28
  0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0


 I can manually do drbdadm primary admin on both nodes and get a
 Primary/Primary state. That still does not get Pacemaker to promote the
 resource.


 The only vaguely relevant lines in /var/log/messages seem to be:

 Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output
 (AdminDrbd:0:start:stdout)
 Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output:
 (AdminDrbd:0:start:stderr) Could not map uname=
 hypatia-tb.nevis.columbia.edu to
 a UUID: The object/attribute does not exist
 Jan 30 17:38:13 hypatia-tb lrmd: [11260]: info: RA output
 (AdminDrbd:0:start:stdout)


 I've tried running with iptables both on and off, and the results are the
 same.


 Any clues?
 --
 Bill Seligman | Phone: (914) 591-2823
 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
 PO Box 137|
 Irvington NY 10533 USA| http://www.nevis.columbia.edu/~seligman/


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

68 matches

Mail list logo