> On 12 Nov 2014, at 5:16 pm, Ho, Alamsyah - ACE Life Indonesia > <alamsyah...@acegroup.com> wrote: > > Hi All, > > On October archives, I saw the issue reported by Felix Zachlod on > http://oss.clusterlabs.org/pipermail/pacemaker/2014-October/022653.html and > the same is actually happens to me now on dual primary DRBD node. > > My current OS was RHEL 6.6 and software version that I used was > pacemaker-1.1.12-4.el6.x86_64 > corosync-1.4.7-1.el6.x86_64 > cman-3.0.12.1-68.el6.x86_64
Since you're using cman, let it start the dlm and gfs controld's. Do not create them as resources in pacemaker. > drbd84-utils-8.9.1-1.el6.elrepo.x86_64 > kmod-drbd84-8.4.5-2.el6.elrepo.x86_64 > gfs2-utils-3.0.12.1-68.el6.x86_64 > > First, I will explain my existing resource. I have 3 resource which are drbd, > dlm for gfs2, and HomeFS. > > Master: HomeDataClone > Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 > notify=true interval=0s > Resource: HomeData (class=ocf provider=linbit type=drbd) > Attributes: drbd_resource=homedata > Operations: start interval=0s timeout=240 (HomeData-start-timeout-240) > promote interval=0s (HomeData-promote-interval-0s) > demote interval=0s timeout=90 (HomeData-demote-timeout-90) > stop interval=0s timeout=100 (HomeData-stop-timeout-100) > monitor interval=60s (HomeData-monitor-interval-60s) > Clone: HomeFS-clone > Meta Attrs: start-delay=30s target-role=Stopped > Resource: HomeFS (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/drbd/by-res/homedata directory=/home fstype=gfs2 > Operations: start interval=0s timeout=60 (HomeFS-start-timeout-60) > stop interval=0s timeout=60 (HomeFS-stop-timeout-60) > monitor interval=20 timeout=40 (HomeFS-monitor-interval-20) > Clone: dlm-clone > Meta Attrs: clone-max=2 clone-node-max=1 start-delay=0s > Resource: dlm (class=ocf provider=pacemaker type=controld) > Operations: start interval=0s timeout=90 (dlm-start-timeout-90) > stop interval=0s timeout=100 (dlm-stop-timeout-100) > monitor interval=60s (dlm-monitor-interval-60s) > > > But when I try to start the cluster on normal condition, It will cause split > brain on DRBD on each node. From the log I can see it was the same case with > Felix which was caused by pacemaker promoting drbd to primary while it was > still waiting for handshake connection on each node. > > Nov 12 11:37:32 node002 kernel: block drbd1: disk( Attaching -> UpToDate ) > Nov 12 11:37:32 node002 kernel: block drbd1: attached to UUIDs > C9630089EC3B58CC:0000000000000000:B4653C665EBC0DBB:B4643C665EBC0DBA > Nov 12 11:37:32 node002 kernel: drbd homedata: conn( StandAlone -> > Unconnected ) > Nov 12 11:37:32 node002 kernel: drbd homedata: Starting receiver thread (from > drbd_w_homedata [22531]) > Nov 12 11:37:32 node002 kernel: drbd homedata: receiver (re)started > Nov 12 11:37:32 node002 kernel: drbd homedata: conn( Unconnected -> > WFConnection ) > Nov 12 11:37:32 node002 attrd[22340]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-HomeData (1000) > Nov 12 11:37:32 node002 attrd[22340]: notice: attrd_perform_update: Sent > update 17: master-HomeData=1000 > Nov 12 11:37:32 node002 crmd[22342]: notice: process_lrm_event: Operation > HomeData_start_0: ok (node=node002, call=18, rc=0, cib-update=13, > confirmed=true) > Nov 12 11:37:33 node002 crmd[22342]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=19, rc=0, cib-update=0, > confirmed=true) > Nov 12 11:37:33 node002 crmd[22342]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=20, rc=0, cib-update=0, > confirmed=true) > Nov 12 11:37:33 node002 kernel: block drbd1: role( Secondary -> Primary ) > Nov 12 11:37:33 node002 kernel: block drbd1: new current UUID > 58F02AE0E03C1C91:C9630089EC3B58CC:B4653C665EBC0DBB:B4643C665EBC0DBA > Nov 12 11:37:33 node002 crmd[22342]: notice: process_lrm_event: Operation > HomeData_promote_0: ok (node=node002, call=21, rc=0, cib-update=14, > confirmed=true) > Nov 12 11:37:33 node002 attrd[22340]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-HomeData (10000) > Nov 12 11:37:33 node002 attrd[22340]: notice: attrd_perform_update: Sent > update 23: master-HomeData=10000 > Nov 12 11:37:33 node002 crmd[22342]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=22, rc=0, cib-update=0, > confirmed=true) > Nov 12 11:37:33 node002 kernel: drbd homedata: Handshake successful: Agreed > network protocol version 101 > Nov 12 11:37:33 node002 kernel: drbd homedata: Agreed to support TRIM on > protocol level > Nov 12 11:37:33 node002 kernel: drbd homedata: Peer authenticated using 20 > bytes HMAC > Nov 12 11:37:33 node002 kernel: drbd homedata: conn( WFConnection -> > WFReportParams ) > Nov 12 11:37:33 node002 kernel: drbd homedata: Starting asender thread (from > drbd_r_homedata [22543]) > Nov 12 11:37:33 node002 kernel: block drbd1: drbd_sync_handshake: > Nov 12 11:37:33 node002 kernel: block drbd1: self > 58F02AE0E03C1C91:C9630089EC3B58CC:B4653C665EBC0DBB:B4643C665EBC0DBA bits:0 > flags:0 > Nov 12 11:37:33 node002 kernel: block drbd1: peer > 0FAA8E4B66817421:C9630089EC3B58CD:B4653C665EBC0DBA:B4643C665EBC0DBA bits:0 > flags:0 > Nov 12 11:37:33 node002 kernel: block drbd1: uuid_compare()=100 by rule 90 > Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm > initial-split-brain minor-1 > Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm > initial-split-brain minor-1 exit code 0 (0x0) > Nov 12 11:37:33 node002 kernel: block drbd1: Split-Brain detected but > unresolved, dropping connection! > Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm > split-brain minor-1 > Nov 12 11:37:33 node002 kernel: block drbd1: helper command: /sbin/drbdadm > split-brain minor-1 exit code 0 (0x0) > Nov 12 11:37:33 node002 kernel: drbd homedata: conn( WFReportParams -> > Disconnecting ) > Nov 12 11:37:33 node002 kernel: drbd homedata: error receiving ReportState, > e: -5 l: 0! > Nov 12 11:37:33 node002 kernel: drbd homedata: asender terminated > Nov 12 11:37:33 node002 kernel: drbd homedata: Terminating drbd_a_homedata > Nov 12 11:37:33 node002 kernel: drbd homedata: Connection closed > Nov 12 11:37:33 node002 kernel: drbd homedata: conn( Disconnecting -> > StandAlone ) > Nov 12 11:37:33 node002 kernel: drbd homedata: receiver terminated > Nov 12 11:37:33 node002 kernel: drbd homedata: Terminating drbd_r_homedata > > But if I disable the other two resource and only have HomeDataClone resource > enabled on cluster startup, then drbd device starts connected and both nodes > promoted to primary. Here is the log > > Nov 12 12:38:11 node002 kernel: drbd homedata: Starting worker thread (from > drbdsetup-84 [26752]) > Nov 12 12:38:11 node002 kernel: block drbd1: disk( Diskless -> Attaching ) > Nov 12 12:38:11 node002 kernel: drbd homedata: Method to ensure write > ordering: flush > Nov 12 12:38:11 node002 kernel: block drbd1: max BIO size = 1048576 > Nov 12 12:38:11 node002 kernel: block drbd1: drbd_bm_resize called with > capacity == 314563128 > Nov 12 12:38:11 node002 kernel: block drbd1: resync bitmap: bits=39320391 > words=614382 pages=1200 > Nov 12 12:38:11 node002 kernel: block drbd1: size = 150 GB (157281564 KB) > Nov 12 12:38:11 node002 kernel: block drbd1: recounting of set bits took > additional 7 jiffies > Nov 12 12:38:11 node002 kernel: block drbd1: 0 KB (0 bits) marked out-of-sync > by on disk bit-map. > Nov 12 12:38:11 node002 kernel: block drbd1: disk( Attaching -> UpToDate ) > Nov 12 12:38:11 node002 kernel: block drbd1: attached to UUIDs > 01FA7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421 > Nov 12 12:38:11 node002 kernel: drbd homedata: conn( StandAlone -> > Unconnected ) > Nov 12 12:38:11 node002 kernel: drbd homedata: Starting receiver thread (from > drbd_w_homedata [26753]) > Nov 12 12:38:11 node002 kernel: drbd homedata: receiver (re)started > Nov 12 12:38:11 node002 kernel: drbd homedata: conn( Unconnected -> > WFConnection ) > Nov 12 12:38:11 node002 attrd[26577]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-HomeData (1000) > Nov 12 12:38:11 node002 attrd[26577]: notice: attrd_perform_update: Sent > update 17: master-HomeData=1000 > Nov 12 12:38:11 node002 crmd[26579]: notice: process_lrm_event: Operation > HomeData_start_0: ok (node=node002, call=18, rc=0, cib-update=12, > confirmed=true) > Nov 12 12:38:11 node002 crmd[26579]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=19, rc=0, cib-update=0, > confirmed=true) > Nov 12 12:38:11 node002 kernel: drbd homedata: Handshake successful: Agreed > network protocol version 101 > Nov 12 12:38:11 node002 kernel: drbd homedata: Agreed to support TRIM on > protocol level > Nov 12 12:38:11 node002 kernel: drbd homedata: Peer authenticated using 20 > bytes HMAC > Nov 12 12:38:11 node002 kernel: drbd homedata: conn( WFConnection -> > WFReportParams ) > Nov 12 12:38:11 node002 kernel: drbd homedata: Starting asender thread (from > drbd_r_homedata [26764]) > Nov 12 12:38:11 node002 kernel: block drbd1: drbd_sync_handshake: > Nov 12 12:38:11 node002 kernel: block drbd1: self > 01FA7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421 bits:0 > flags:0 > Nov 12 12:38:11 node002 kernel: block drbd1: peer > 4499ABF2AAE91DF2:01FA7FA3D219A8B5:0FAB8E4B66817421:0FAA8E4B66817421 bits:0 > flags:0 > Nov 12 12:38:11 node002 kernel: block drbd1: uuid_compare()=-1 by rule 50 > Nov 12 12:38:11 node002 kernel: block drbd1: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( > DUnknown -> UpToDate ) > Nov 12 12:38:11 node002 kernel: block drbd1: receive bitmap stats > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Nov 12 12:38:11 node002 kernel: block drbd1: send bitmap stats > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Nov 12 12:38:11 node002 kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID ) > Nov 12 12:38:11 node002 kernel: block drbd1: updated sync uuid > 01FB7FA3D219A8B4:0000000000000000:0FAB8E4B66817420:0FAA8E4B66817421 > Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm > before-resync-target minor-1 > Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm > before-resync-target minor-1 exit code 0 (0x0) > Nov 12 12:38:11 node002 kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) > disk( Outdated -> Inconsistent ) > Nov 12 12:38:11 node002 kernel: block drbd1: Began resync as SyncTarget (will > sync 0 KB [0 bits set]). > Nov 12 12:38:11 node002 kernel: block drbd1: Resync done (total 1 sec; paused > 0 sec; 0 K/sec) > Nov 12 12:38:11 node002 kernel: block drbd1: updated UUIDs > 4499ABF2AAE91DF2:0000000000000000:01FB7FA3D219A8B4:01FA7FA3D219A8B5 > Nov 12 12:38:11 node002 kernel: block drbd1: conn( SyncTarget -> Connected ) > disk( Inconsistent -> UpToDate ) > Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm > after-resync-target minor-1 > Nov 12 12:38:11 node002 crm-unfence-peer.sh[26804]: invoked for homedata > Nov 12 12:38:11 node002 kernel: block drbd1: helper command: /sbin/drbdadm > after-resync-target minor-1 exit code 0 (0x0) > Nov 12 12:38:12 node002 crmd[26579]: notice: process_lrm_event: Operation > dlm_stop_0: ok (node=node002, call=17, rc=0, cib-update=13, confirmed=true) > Nov 12 12:38:13 node002 crmd[26579]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=20, rc=0, cib-update=0, > confirmed=true) > Nov 12 12:38:13 node002 kernel: block drbd1: peer( Secondary -> Primary ) > Nov 12 12:38:13 node002 kernel: block drbd1: role( Secondary -> Primary ) > Nov 12 12:38:13 node002 crmd[26579]: notice: process_lrm_event: Operation > HomeData_promote_0: ok (node=node002, call=21, rc=0, cib-update=14, > confirmed=true) > Nov 12 12:38:13 node002 attrd[26577]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-HomeData (10000) > Nov 12 12:38:13 node002 attrd[26577]: notice: attrd_perform_update: Sent > update 23: master-HomeData=10000 > Nov 12 12:38:13 node002 crmd[26579]: notice: process_lrm_event: Operation > HomeData_notify_0: ok (node=node002, call=22, rc=0, cib-update=0, > confirmed=true) > > So based on the result above, then I tried to add constraint order to start > HomeDataClone then start dlm-clone while still disabling HomeFS-clone. The > result is drbd startup both connected and become primary but it seems > dlm-clone cannot be started after HomeDataClone and that also caused > dlm-clone service to be stopped and not running. > > Master/Slave Set: HomeDataClone [HomeData] > Masters: [ node001 node002 ] > Clone Set: HomeFS-clone [HomeFS] > Stopped: [ node001 node002 ] > Clone Set: dlm-clone [dlm] > Stopped: [ node001 node002 ] > > Failed actions: > dlm_start_0 on node001 'not configured' (6): call=21, status=complete, > last-rc-change='Wed Nov 12 12:46:42 2014', queued=0ms, exec=59ms > dlm_start_0 on node002 'not configured' (6): call=21, status=complete, > last-rc-change='Wed Nov 12 12:46:40 2014', queued=0ms, exec=69ms > > > Please help to answer some of my questions > 1. Why pacemaker controld failed if I set the constraint order to start > drbd resource first then after that start controld? > 2. Is there a configuration to enable drbd promote delay? For example > after start drbd service then wait before promoting the resource to primary > > So right now, I am stuck on getting cluster running and managed all the > resources without any error at all. My temporary solution is to start the > drbd service manually first before using pcs to start up the cluster. Of > course this is not the best practices so I would like to ask any advice or > feedback to fix this issue. > > > Thanks before for any hint/advice. > > ___________________________________________________________________ > This email is intended for the designated recipient(s) only, and may be > confidential, non-public, proprietary, protected by the attorney/client or > other privilege. Unauthorized reading, distribution, copying or other use of > this communication is prohibited and may be unlawful. Receipt by anyone other > than the intended recipient(s) should not be deemed a waiver of any privilege > or protection. If you are not the intended recipient or if you believe that > you have received this email in error, please notify the sender immediately > and delete all copies from your computer system without reading, saving, or > using it in any manner. Although it has been checked for viruses and other > malicious software ("malware"), we do not warrant, represent or guarantee in > any way that this communication is free of malware or potentially damaging > defects. All liability for any actual or alleged loss, damage, or injury > arising out of or resulting in any way from the receipt, opening or use of > this email is expressly disclaimed. > ______________________________________________________________________ > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org