On Apr 1, 2011, at 3:22 AM, Tim Serong wrote: > On 4/1/2011 at 11:37 AM, Vadym Chepkov <[email protected]> wrote: >> On Mar 31, 2011, at 2:30 PM, Christoph Bartoschek wrote: >> >>> Am 29.03.2011 15:31, schrieb Dejan Muhamedagic: >>>> On Tue, Mar 29, 2011 at 08:13:49AM +0200, Christoph Bartoschek wrote: >>>>> Am 29.03.2011 02:35, schrieb Vadym Chepkov: >>>>>> >>>>>> On Mar 28, 2011, at 10:55 AM, Christoph Bartoschek wrote: >>>>>> >>>>>>> Am 28.03.2011 16:30, schrieb Dejan Muhamedagic: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Mon, Mar 21, 2011 at 11:33:49PM +0100, Christoph Bartoschek wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I am testing a NFS failover setup. During the tests I created a >>>>>>>>> split-brain situation and now node A thinks it is primary and >>>>>>>>> uptodate >>>>>>>>> while node B thinks that it is Outdated. >>>>>>>>> >>>>>>>>> crm_mon however does not indicate any error to me. Why is this the >>>>>>>>> case? >>>>>>>>> I expect to see anything that shows me the degraded status. How can >>>>>>>>> this >>>>>>>>> be fixed? >>>>>>>> >>>>>>>> The cluster relies on the RA (in this case drbd) to report any >>>>>>>> problems. Do you have a monitor operation defined for that >>>>>>>> resource? >>>>>>> >>>>>>> I have the resource defined as: >>>>>>> >>>>>>> primitive p_drbd ocf:linbit:drbd \ >>>>>>> params drbd_resource="home-data" >>>>>>> op monitor interval="15" role="Master" \ >>>>>>> op monitor interval="30" role="Slave" >>>>>>> >>>>>>> Is this a correct monitor operation? >>>> >>>> Yes, though you should also add timeout specs. >>>> >>>>>> Just out of curiosity, you do have ms resource defined? >>>>>> >>>>>> ms ms_p_drbd p_drbd \ >>>>>> meta master-max="1" master-node-max="1" clone-max="2" >>>>>> clone-node-max="1" >> notify="true" >>>>>> >>>>>> Because if you do and cluster is not aware of the split-brain, drbd RA >>>>>> has a >> serious flaw. >>>>>> >>>>> >>>>> I'm sorry. Yes, the ms resource is also defined. >>>> >>>> Well, I'm really confused. You basically say that the drbd disk >>>> gets into a degraded mode (i.e. it detects split brain), but the >>>> cluster (pacemaker) never learns about that. Perhaps you should >>>> open a bugzilla for this and supply hb_report. Though it's >>>> really hard to believe. It's like basic functionality failing. >>> >>> >>> What would you expect to see? >>> >>> Currently I see the following in crm_mon: >>> >>> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] >>> Masters: [ ries ] >>> Slaves: [ laplace ] >>> >>> >>> At the same time "cat /proc/drbd" on ries says: >>> >>> ries:~ # cat /proc/drbd >>> version: 8.3.9 (api:88/proto:86-95) >>> srcversion: A67EB2D25C5AFBFF3D8B788 >>> 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- >>> ns:0 nr:0 dw:4 dr:1761 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4 >>> >>> >>> And on node laplace it says: >>> >>> laplace:~ # cat /proc/drbd >>> version: 8.3.9 (api:88/proto:86-95) >>> srcversion: A67EB2D25C5AFBFF3D8B788 >>> 0: cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r----- >>> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4 >>> >>> >> >> >> >> yes, and according to the RA script everything is perfect: >> >> drbd_status() { >> local rc >> rc=$OCF_NOT_RUNNING >> >> if ! is_drbd_enabled || ! [ -b "$DRBD_DEVICE" ]; then >> return $rc >> fi >> >> # ok, module is loaded, block device node exists. >> # lets see its status >> drbd_set_status_variables >> case "${DRBD_ROLE_LOCAL}" in >> Primary) >> rc=$OCF_RUNNING_MASTER >> ;; >> Secondary) >> rc=$OCF_SUCCESS >> ;; >> Unconfigured) >> rc=$OCF_NOT_RUNNING >> ;; >> *) >> ocf_log err "Unexpected role ${DRBD_ROLE_LOCAL}" >> rc=$OCF_ERR_GENERIC >> esac >> >> return $rc >> } >> >> Staggering. >> >> drbd_set_status_variable subroutine does set DRBD_CSTATE >> >> I think the RA needs to be modified to something like this: >> >> Secondary) >> if [[ $DRBD_CSTATE == Connected ]]; then >> rc=$OCF_SUCCESS >> else >> rc=$OCF_NOT_RUNNING >> fi > > That wouldn't strictly be correct - DRBD *is* currently running on > both nodes, Primary (master) on one and Secondary (slave) on the > other. This state is correctly reported in crm_mon. The thing > that crm_mon can't tell you is that *third* piece of information, > i.e. that there's some sort of communication breakdown between > the two instances. >
Well, it is definitely not doing it's "Slave" job when it is not connected. > That being said, I'll defer to the DRBD crew as to whether or not > returning $OCF_NOT_RUNNING in this case is technically safe and/or > desirable. > > (I know its administratively highly desirable to see these failures, > of course, I'm just not clear on how best to expose them). > Well, the current situation is unacceptable, at least for me. I shut everything down, disconnected direct link , started cluster back and there is no indication whatsoever in the cluster status that drbd is in trouble, except location constraint added by crm-fence-peer.sh Even scores attributes for the master resource are not negative on the disconnected secondary. After I applied my fix all, is kosher - I get Slave as stopped, I get fail-count. Vadym _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
