Re: [Linux-HA] Centos 6.4 KVM+ DRBD 8.4.2 + Pacemaker 1.1.10 - Ddbd Monitor Timeout? Drbd promotion/demotion failing!

Jimmy Magee Wed, 28 Aug 2013 02:39:21 -0700

Hi Andreas,

Correct, the performance of the box seem to screw things up re time-outs..


Cheers,
Jimmy.



On 28 Aug 2013, at 10:35, "Andreas Mock" <[email protected]> wrote:

> Hi Jimmy,
> 
> thank you for answering.
> 
> Am I right to asume that is seems tohave something to
> do with timing? (Performance of the server relevant?)
> 
> Best regards
> Andreas Mock
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: [email protected]
> [mailto:[email protected]] Im Auftrag von Jimmy Magee
> Gesendet: Mittwoch, 28. August 2013 10:42
> An: General Linux-HA mailing list
> Betreff: Re: [Linux-HA] Centos 6.4 KVM+ DRBD 8.4.2 + Pacemaker 1.1.10 - Ddbd
> Monitor Timeout? Drbd promotion/demotion failing!
> 
> Hi Andreas,
> 
> I could not get it working with the setup outlined below.. 
> I built the same setup on a higher spec server and it seems to have resolved
> the issue..
> 
> Cheers,
> Jimmy. 
> 
> 
> 
> On 28 Aug 2013, at 09:03, Andreas Mock <[email protected]> wrote:
> 
>> Hi Jimmy,
>> 
>> have you found a solution for your problem?
>> I found your thread while searching for a similar problem I have.
>> 
>> Best regards
>> Andreas Mock
>> 
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: [email protected]
>> [mailto:[email protected]] Im Auftrag von Jimmy Magee
>> Gesendet: Donnerstag, 4. Juli 2013 11:40
>> An: [email protected] mailing list
>> Betreff: [Linux-HA] Centos 6.4 KVM+ DRBD 8.4.2 + Pacemaker 1.1.10 - Ddbd
>> Monitor Timeout? Drbd promotion/demotion failing!
>> 
>> Hi All,
>> 
>> Currently setting up a drbd pacemaker cluster on two Centos 6.4 kvm's. The
>> kvm's are running on a Centos 6.4 host operating, each vm installed on
>> separate logical volumes,  with 1 GB ram allocated to each vm. DRBD 8.4.2
> +
>> Pacemaker 1.1.10 + Corosync 1.4.1-15 + Cman 3.0.12.1-49 cluster software
> is
>> installed and configured on both vms.
>> Drbb starts manually and promoting/demoting the device via drbdadm while
>> testing is perfect. Drbd under ha control causes the primary node to
>> repeatedly restart drbd / dependant resources and
>> fails to promote drbd on active node when primary is in standby mode..
>> Observing drbd via service drbd status or /proc/drbd when both nodes are
>> online all seems healthy, however the drbd monitor is timing out causing
> the
>> resources to be restart on the primary node..
>> 
>> Jul  3 21:47:19 webtext-2 lrmd[2511]:  warning: child_timeout_callback:
>> mysql_drbd_monitor_0 process (PID 18391) timed out
>> Jul  3 21:47:21 webtext-2 crmd[2514]:    error: process_lrm_event: LRM
>> operation mysql_drbd_monitor_0 (666) Timed Out (timeout=20000ms)
>> Jul  3 22:51:46 webtext-2 lrmd[19204]:  warning: child_timeout_callback:
>> mysql_drbd_promote_0 process (PID 21046) timed out
>> Jul  3 22:51:47 webtext-2 crmd[19207]:    error: process_lrm_event: LRM
>> operation mysql_drbd_promote_0 (189) Timed Out (timeout=20000ms)
>> 
>> 
>> I have adjusted a number of timeout settings to allow for the limited
>> hardware however not getting consistent failover.  Current cluster and
> drbd
>> config setting below,and the full corosync/pacemaker logs available
>> here....https://dl.dropboxusercontent.com/u/89694994/corosync-logs.zip
> with
>> more detailed info.
>> Appreciate some guidance on resolving this issue.
>> 
>> Many thanks,
>> Jimmy.
>> 
>> 
>> 
>> Cluster Setup and Configs.
>> 
>> 
>> # cat /proc/drbd
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R6,
>> 2012-09-06 08:16:10
>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>   ns:376 nr:0 dw:376 dr:6961 al:7 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f
> oos:0
>> 
>> Primary node is put in standby mode
>> 
>> 
>> # cat /proc/drbd  (webtext-2)
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R6,
>> 2012-09-06 08:16:10
>> 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Outdated C r-----
>>   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> 
>> # cat /proc/drbd (webtext-1)
>> version: 8.4.2 (api:1/proto:86-101)
>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R6,
>> 2012-09-06 08:16:10
>> 
>> 
>> 
>> 
>> 
>> 
>> # crm_mon --inactive --group-by-node -1
>> Last updated: Wed Jul  3 23:17:32 2013
>> Last change: Wed Jul  3 22:51:38 2013 via cibadmin on webtext-2
>> Stack: cman
>> Current DC: webtext-1 - partition with quorum
>> Version: 1.1.10-1.el6-2718638
>> 2 Nodes configured, unknown expected votes
>> 5 Resources configured.
>> 
>> 
>> Node webtext-1: standby
>> Node webtext-2: online
>>      mysql_drbd      (ocf::linbit:drbd):     Started 
>> 
>> Inactive resources:
>> 
>> Master/Slave Set: mysql_ms [mysql_drbd]
>>    Slaves: [ webtext-2 ]
>>    Stopped: [ webtext-1 ]
>> Resource Group: mysql
>>    mysql_fs  (ocf::heartbeat:Filesystem):    Stopped 
>>    mysql_init        (lsb:mysql):    Stopped 
>> jboss_init   (lsb:jboss):    Stopped 
>> 
>> Failed actions:
>>   mysql_drbd_monitor_130000 (node=webtext-1, call=349, rc=1, status=Timed
>> Out, last-rc-change=Wed Jul  3 22:44:49 2013
>> , queued=0ms, exec=0ms
>> ): unknown error
>>   mysql_drbd_promote_0 (node=webtext-2, call=189, rc=1, status=Timed Out,
>> last-rc-change=Wed Jul  3 22:51:25 2013
>> , queued=20980ms, exec=13ms
>> ): unknown error
>> 
>> 
>> 
>> # crm configure show
>> node webtext-1 \
>>      attributes standby="on"
>> node webtext-2 \
>>      attributes standby="off"
>> primitive jboss_init lsb:jboss \
>>      op monitor interval="40" timeout="120" start-delay="320" \
>>      op start interval="0" timeout="320" \
>>      op stop interval="0" timeout="320" \
>>      meta target-role="Started"
>> primitive mysql_drbd ocf:linbit:drbd \
>>      params drbd_resource="r0" \
>>      op monitor interval="130" role="Master" \
>>      op monitor interval="140" role="Slave" \
>>      op stop interval="0" timeout="240" \
>>      op start interval="0" timeout="320" \
>>      meta target-role="Started"
>> primitive mysql_fs ocf:heartbeat:Filesystem \
>>      params device="/dev/drbd/by-res/r0" directory="/drbd0/"
>> fstype="ext4" \
>>      op stop interval="0" timeout="120" \
>>      op start interval="0" timeout="120" \
>>      op monitor interval="40" timeout="120" \
>>      meta is-managed="true"
>> primitive mysql_init lsb:mysql \
>>      op stop interval="0" timeout="320" \
>>      op start interval="0" timeout="320" \
>>      meta is-managed="true"
>> group mysql mysql_fs mysql_init \
>>      meta target-role="Started"
>> ms mysql_ms mysql_drbd \
>>      meta master-max="1" master-node-max="1" clone-max="2"
>> clone-node-max="1" notify="true" is-managed="true"
>> location drbd-fence-by-handler-r0-mysql_ms mysql_ms \
>>      rule $id="drbd-fence-by-handler-r0-rule-mysql_ms" $role="Master"
>> -inf: #uname ne webtext-2.vennetics.com
>> colocation jboss_with_mysql inf: jboss_init mysql
>> colocation mysql_on_drbd inf: mysql mysql_ms:Master
>> order jboss_after_mysql inf: mysql_init jboss_init
>> order mysql_after_drbd inf: mysql_ms:promote mysql:start
>> property $id="cib-bootstrap-options" \
>>      dc-version="1.1.10-1.el6-2718638" \
>>      cluster-infrastructure="cman" \
>>      stonith-enabled="false" \
>>      last-lrm-refresh="1372884412" \
>>      no-quorum-policy="ignore"
>> 
>> 
>> 
>> 
>> #vi /etc/drbd.conf
>> 
>> 
>> global {
>>   usage-count yes;
>> }
>> 
>> resource r0 {
>> 
>>   # write IO is reported as completed if it has reached both local
>>   # and remote disk
>>   protocol C;
>> 
>>   net {
>>       # set up peer authentication
>>       cram-hmac-alg sha1;
>>       shared-secret "test";
>>   }
>> 
>>   startup {
>>       # wait for connection timeout - boot process blocked
>>       # until DRBD resources are connected
>>              # -----  wfc-timeout 30;
>>       # WFC timeout if peer was outdated
>>       # -----  outdated-wfc-timeout 20;
>>       # WFC timeout if this node was in a degraded cluster (i.e. only had
>> one
>>       # node left)
>>       # -----   degr-wfc-timeout 30;
>>   }
>> 
>>   disk {
>>        fencing resource-only;  
>>   }
>> 
>>   handlers {
>>       fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>       after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>   }
>> 
>> 
>>   # first node
>>   on webtext-1.vennetics.com {
>>       # DRBD device
>>       device /dev/drbd0;
>>       # backing store device
>>       disk /dev/vg_webtext1_02/lv_drbd0;
>>       # IP address of node, and port to listen on
>>       address 10.87.79.218:7788;
>>       # use internal meta data (don't create a filesystem before 
>>       # you create metadata!)
>>       meta-disk internal;
>>   }
>>   # second node
>>   on webtext-2.vennetics.com {
>>       # DRBD debice
>>       device /dev/drbd0;
>>       # backing store device
>>       disk /dev/vg_webtext2_02/lv_drbd0;
>>       # IP address of node, and port to listen on
>>       address 10.87.79.219:7788;
>>       # use internal meta data (don't create a filesystem before
>>       # you create metadata!)
>>       meta-disk internal;
>>   }
>> }
>> 
>> 
>> vi /etc/cluster/cluster.conf
>> 
>> <?xml version="1.0"?>
>> <cluster config_version="1" name="webtext_cluster">
>>      <clusternodes>
>>              <clusternode name="webtext-1" nodeid="1">
>>                      <fence>
>>                              <method name="pcmk-redirect">
>>                                      <device name="pcmk"
>> port="webtext-1"/>
>>                              </method>
>>                      </fence>
>>              </clusternode>
>>              <clusternode name="webtext-2" nodeid="2">
>>                      <fence>
>>                              <method name="pcmk-redirect">
>>                                      <device name="pcmk"
>> port="webtext-2"/>
>>                              </method>
>>                      </fence>
>>              </clusternode>
>>      </clusternodes>
>>      <fencedevices>
>>              <fencedevice agent="fence_pcmk" name="pcmk"/>
>>      </fencedevices>
>>      <cman expected_votes="1" two_node="1"/>
>>      <logging to_syslog="yes" to_logfile="yes" syslog_facility="daemon"
>>               syslog_priority="info" logfile_priority="info">
>>      <logging_daemon name="qdiskd"
>>            logfile="/var/log/cluster/qdiskd.log"
>> logfile_priority="debug"/>
>>      <logging_daemon name="fenced"
>>            logfile="/var/log/cluster/fenced.log"
>> logfile_priority="debug"/>
>>      <logging_daemon name="dlm_controld"
>>            logfile="/var/log/cluster/dlm_controld.log"
>> logfile_priority="debug"/>
>>      <logging_daemon name="gfs_controld"
>>            logfile="/var/log/cluster/gfs_controld.log"
>> logfile_priority="debug"/>
>>      <logging_daemon name="corosync" 
>>            logfile="/var/log/cluster/corosync.log"
>> logfile_priority="debug"/>
>>      </logging>
>> </cluster>
>> 
>> 
>> # vi /etc/sysconfig/pacemaker 
>> 
>> # For non-systemd based systems, prefix export to each enabled line
>> 
>> # Turn on special handling for CMAN clusters in the init script
>> # Without this, fenced (and by inference, cman) cannot reliably be made to
>> shut down
>> PCMK_STACK=cman
>> 
>> #==#==# Variables that control logging
>> 
>> # Enable debug logging globally or per-subsystem
>> # Multiple subsystems may me listed separated by commas
>> PCMK_debug=crmd,pengine,cib,stonith-ng,attrd,pacemakerd
>> 
>> 
>> # rpm -qa | grep pacemaker
>> pacemaker-cli-1.1.10-1.el6.x86_64
>> pacemaker-libs-1.1.10-1.el6.x86_64
>> pacemaker-cluster-libs-1.1.10-1.el6.x86_64
>> pacemaker-libs-devel-1.1.10-1.el6.x86_64
>> pacemaker-remote-1.1.10-1.el6.x86_64
>> pacemaker-cts-1.1.10-1.el6.x86_64
>> pacemaker-debuginfo-1.1.10-1.el6.x86_64
>> pacemaker-1.1.10-1.el6.x86_64
>> 
>> # rpm -qa | grep cman
>> cman-3.0.12.1-49.el6.x86_64
>> 
>> # rpm -qa | grep coro
>> corosync-1.4.1-15.el6_4.1.x86_64
>> corosynclib-1.4.1-15.el6_4.1.x86_64
>> corosynclib-devel-1.4.1-15.el6_4.1.x86_64
>> 
>> # rpm -qa | grep resource-agents
>> resource-agents-3.9.2-21.el6.x86_64
>> 
>> 
>> # rpm -qa | grep libqb
>> libqb-devel-0.14.2-3.el6.x86_64
>> libqb-0.14.2-3.el6.x86_64
>> 
>> # rpm -qa | grep drbd
>> drbd84-utils-8.4.2-1.el6.elrepo.x86_64
>> kmod-drbd84-8.4.2-1.el6_3.elrepo.x86_64
>> 
>> 
>> # ifconfig (webtext-2)
>> eth0      Link encap:Ethernet  HWaddr 52:54:00:65:EC:27  
>>         inet addr:10.87.79.217  Bcast:10.87.79.255  Mask:255.255.255.0
>>         inet6 addr: fe80::5054:ff:fe65:ec27/64 Scope:Link
>>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>         RX packets:116526 errors:0 dropped:0 overruns:0 frame:0
>>         TX packets:104213 errors:0 dropped:0 overruns:0 carrier:0
>>         collisions:0 txqueuelen:1000 
>>         RX bytes:19340444 (18.4 MiB)  TX bytes:53027494 (50.5 MiB)
>>         Interrupt:10 Base address:0xc000 
>> 
>> eth1      Link encap:Ethernet  HWaddr 52:54:00:95:68:C1  
>>         inet addr:10.87.79.219  Bcast:10.87.79.255  Mask:255.255.255.0
>>         inet6 addr: fe80::5054:ff:fe95:68c1/64 Scope:Link
>>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>         RX packets:436 errors:0 dropped:0 overruns:0 frame:0
>>         TX packets:14 errors:0 dropped:0 overruns:0 carrier:0
>>         collisions:0 txqueuelen:1000 
>>         RX bytes:25724 (25.1 KiB)  TX bytes:900 (900.0 b)
>>         Interrupt:10 Base address:0xe000 
>> 
>> # ifconfig (webtext-1)
>> eth0      Link encap:Ethernet  HWaddr 52:54:00:CB:9A:F4  
>>         inet addr:10.87.79.216  Bcast:10.87.79.255  Mask:255.255.255.0
>>         inet6 addr: fe80::5054:ff:fecb:9af4/64 Scope:Link
>>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>         RX packets:121593 errors:0 dropped:0 overruns:0 frame:0
>>         TX packets:107920 errors:0 dropped:0 overruns:0 carrier:0
>>         collisions:0 txqueuelen:1000 
>>         RX bytes:54007733 (51.5 MiB)  TX bytes:22464750 (21.4 MiB)
>>         Interrupt:10 Base address:0xc000 
>> 
>> eth1      Link encap:Ethernet  HWaddr 52:54:00:30:07:C9  
>>         inet addr:10.87.79.218  Bcast:10.87.79.255  Mask:255.255.255.0
>>         inet6 addr: fe80::5054:ff:fe30:7c9/64 Scope:Link
>>         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>         RX packets:510 errors:0 dropped:0 overruns:0 frame:0
>>         TX packets:12 errors:0 dropped:0 overruns:0 carrier:0
>>         collisions:0 txqueuelen:1000 
>>         RX bytes:29934 (29.2 KiB)  TX bytes:720 (720.0 b)
>>         Interrupt:10 Base address:0xe000 
>> 
>> 
>> # lvdisplay
>> --- Logical volume ---
>> LV Path                /dev/vg_webtext2_02/lv_drbd0
>> LV Name                lv_drbd0
>> VG Name                vg_webtext2_02
>> LV UUID                d8ATq9-XPqT-mTAZ-By3H-dEoL-SoDV-ebCJL3
>> LV Write Access        read/write
>> LV Creation host, time webtext-2.vennetics.com, 2013-06-30 10:35:10 +0100
>> LV Status              available
>> # open                 2
>> LV Size                4.00 GiB
>> Current LE             1023
>> Segments               1
>> Allocation             inherit
>> Read ahead sectors     auto
>> - currently set to     256
>> Block device           253:2
>> 
>> # lvdisplay
>> --- Logical volume ---
>> LV Path                /dev/vg_webtext1_02/lv_drbd0
>> LV Name                lv_drbd0
>> VG Name                vg_webtext1_02
>> LV UUID                3qB7lS-zH0O-WIKC-F6nl-0cuE-2Zu9-95RkF9
>> LV Write Access        read/write
>> LV Creation host, time webtext-1.vennetics.com, 2013-06-30 12:00:59 +0100
>> LV Status              available
>> # open                 0
>> LV Size                4.00 GiB
>> Current LE             1023
>> Segments               1
>> Allocation             inherit
>> Read ahead sectors     auto
>> - currently set to     256
>> Block device           253:2
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Centos 6.4 KVM+ DRBD 8.4.2 + Pacemaker 1.1.10 - Ddbd Monitor Timeout? Drbd promotion/demotion failing!

Reply via email to