Re: [DRBD-user] [ClusterLabs] DRBD fencing issue on failover causes resource failure

Tim Walberg Sat, 19 Mar 2016 01:10:40 -0700

Is there a way to make this work properly without STONITH? I forgot to mention
that both nodes are virtual machines (QEMU/KVM), which makes STONITH a minor
challenge. Also, since these symptoms occur even under "pcs cluster standby",
where STONITH *shouldn't* be invoked, I'm not sure if that's the entire answer.



On 03/16/2016 13:34 -0400, Digimer wrote:
>>      On 16/03/16 01:17 PM, Tim Walberg wrote:
>>      > Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
>>      > (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
>>      > resources consist of a cluster address, a DRBD device mirroring 
>> between
>>      > the two cluster nodes, the file system, and the nfs-server resource. 
>> The
>>      > resources all behave properly until an extended failover or outage.
>>      > 
>>      > I have tested failover in several ways ("pcs cluster standby", "pcs
>>      > cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", 
>> etc.)
>>      > and the symptoms are that, until the killed node is brought back into
>>      > the cluster, failover never seems to complete. The DRBD device appears
>>      > on the remaining node to be in a "Secondary/Unknown" state, and the
>>      > resources end up looking like:
>>      > 
>>      > # pcs status
>>      > Cluster name: nfscluster
>>      > Last updated: Wed Mar 16 12:05:33 2016          Last change: Wed Mar 
>> 16
>>      > 12:04:46 2016 by root via cibadmin on nfsnode01
>>      > Stack: corosync
>>      > Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
>>      > with quorum
>>      > 2 nodes and 5 resources configured
>>      > 
>>      > Online: [ nfsnode01 ]
>>      > OFFLINE: [ nfsnode02 ]
>>      > 
>>      > Full list of resources:
>>      > 
>>      >  nfsVIP      (ocf::heartbeat:IPaddr2):       Started nfsnode01
>>      >  nfs-server     (systemd:nfs-server):   Stopped
>>      >  Master/Slave Set: drbd_master [drbd_dev]
>>      >      Slaves: [ nfsnode01 ]
>>      >      Stopped: [ nfsnode02 ]
>>      >  drbd_fs   (ocf::heartbeat:Filesystem):    Stopped
>>      > 
>>      > PCSD Status:
>>      >   nfsnode01: Online
>>      >   nfsnode02: Online
>>      > 
>>      > Daemon Status:
>>      >   corosync: active/enabled
>>      >   pacemaker: active/enabled
>>      >   pcsd: active/enabled
>>      > 
>>      > As soon as I bring the second node back online, the failover 
>> completes.
>>      > But this is obviously not a good state, as an extended outage for any
>>      > reason on one node essentially kills the cluster services. There's
>>      > obviously something I've missed in configuring the resources, but I
>>      > haven't been able to pinpoint it yet.
>>      > 
>>      > Perusing the logs, it appears that, upon the initial failure, DRBD 
>> does
>>      > in fact promote the drbd_master resource, but immediately after that,
>>      > pengine calls for it to be demoted for reasons I haven't been able to
>>      > determine yet, but seems to be tied to the fencing configuration. I 
>> can
>>      > see that the crm-fence-peer.sh script is called, but it almost seems
>>      > like it's fencing the wrong node... Indeed, I do see that it adds a
>>      > -INFINITY location constraint for the surviving node, which would
>>      > explain the decision to demote the DRBD master.
>>      > 
>>      > My DRBD resource looks like this:
>>      > 
>>      > # cat /etc/drbd.d/drbd0.res
>>      > resource drbd0 {
>>      > 
>>      >         protocol C;
>>      >         startup { wfc-timeout 0; degr-wfc-timeout 120; }
>>      > 
>>      >         disk {
>>      >             on-io-error detach;
>>      >             fencing resource-only;
>>      
>>      This should be 'resource-and-stonith;', but alone won't do anything
>>      until pacemaker's stonith is working.
>>      
>>      >         }
>>      > 
>>      >         handlers {
>>      >             fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>>      >             after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>>      >         }
>>      > 
>>      >         on nfsnode01 {
>>      >                 device /dev/drbd0;
>>      >                 disk /dev/vg_nfs/lv_drbd0;
>>      >                 meta-disk internal;
>>      >                 address 10.0.0.2:7788 <http://10.0.0.2:7788>;
>>      >         }
>>      > 
>>      >         on nfsnode02 {
>>      >                 device /dev/drbd0;
>>      >                 disk /dev/vg_nfs/lv_drbd0;
>>      >                 meta-disk internal;
>>      >                 address 10.0.0.3:7788 <http://10.0.0.3:7788>;
>>      >         }
>>      > }
>>      > 
>>      > If I comment out the three lines having to do with fencing, the 
>> failover
>>      > works properly. But I'd prefer to have the fencing there in the odd
>>      > chance that we end up with a split brain instead of just a node 
>> outage...
>>      > 
>>      > And, here's "pcs config --full":
>>      > 
>>      > # pcs config --full
>>      > Cluster Name: nfscluster
>>      > Corosync Nodes:
>>      >  nfsnode01 nfsnode02
>>      > Pacemaker Nodes:
>>      >  nfsnode01 nfsnode02
>>      > 
>>      > Resources:
>>      >  Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
>>      >   Attributes: ip=10.0.0.1 cidr_netmask=24
>>      >   Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
>>      >               stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
>>      >               monitor interval=15s (nfsVIP-monitor-interval-15s)
>>      >  Resource: nfs-server (class=systemd type=nfs-server)
>>      >   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>>      >  Master: drbd_master
>>      >   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>>      > clone-node-max=1 notify=true
>>      >   Resource: drbd_dev (class=ocf provider=linbit type=drbd)
>>      >    Attributes: drbd_resource=drbd0
>>      >    Operations: start interval=0s timeout=240 
>> (drbd_dev-start-interval-0s)
>>      >                promote interval=0s timeout=90 
>> (drbd_dev-promote-interval-0s)
>>      >                demote interval=0s timeout=90 
>> (drbd_dev-demote-interval-0s)
>>      >                stop interval=0s timeout=100 
>> (drbd_dev-stop-interval-0s)
>>      >                monitor interval=29s role=Master
>>      > (drbd_dev-monitor-interval-29s)
>>      >                monitor interval=31s role=Slave
>>      > (drbd_dev-monitor-interval-31s)
>>      >  Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)
>>      >   Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs
>>      >   Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)
>>      >               stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)
>>      >               monitor interval=20 timeout=40 
>> (drbd_fs-monitor-interval-20)
>>      > 
>>      > Stonith Devices:
>>      > Fencing Levels:
>>      > 
>>      > Location Constraints:
>>      > Ordering Constraints:
>>      >   start nfsVIP then start nfs-server (kind:Mandatory)
>>      > (id:order-nfsVIP-nfs-server-mandatory)
>>      >   start drbd_fs then start nfs-server (kind:Mandatory)
>>      > (id:order-drbd_fs-nfs-server-mandatory)
>>      >   promote drbd_master then start drbd_fs (kind:Mandatory)
>>      > (id:order-drbd_master-drbd_fs-mandatory)
>>      > Colocation Constraints:
>>      >   nfs-server with nfsVIP (score:INFINITY)
>>      > (id:colocation-nfs-server-nfsVIP-INFINITY)
>>      >   nfs-server with drbd_fs (score:INFINITY)
>>      > (id:colocation-nfs-server-drbd_fs-INFINITY)
>>      >   drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master)
>>      > (id:colocation-drbd_fs-drbd_master-INFINITY)
>>      > 
>>      > Resources Defaults:
>>      >  resource-stickiness: 100
>>      >  failure-timeout: 60
>>      > Operations Defaults:
>>      >  No defaults set
>>      > 
>>      > Cluster Properties:
>>      >  cluster-infrastructure: corosync
>>      >  cluster-name: nfscluster
>>      >  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>>      >  have-watchdog: false
>>      >  maintenance-mode: false
>>      >  stonith-enabled: false
>>      
>>      Configure *and test* stonith in pacemaker first, then DRBD will hook
>>      into it and use it properly. DRBD simply asks pacemaker to do the fence,
>>      but you currently don't have it setup.
>>      
>>      -- 
>>      Digimer
>>      Papers and Projects: https://alteeve.ca/w/
>>      What if the cure for cancer is trapped in the mind of a person without
>>      access to education?
>>      _______________________________________________
>>      drbd-user mailing list
>>      [email protected]
>>      http://lists.linbit.com/mailman/listinfo/drbd-user
End of included message



-- 
[email protected]
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] [ClusterLabs] DRBD fencing issue on failover causes resource failure

Reply via email to