Whoops, guess I fixated on one aspect of the problem. Specifically, the "I/O errors on the secondary stop I/O on the primary". I was thinking that NFS problems were affecting one or both hosts. I don't think the master will ever deliberately disconnect from the secondary, unless a split-brain occurs.
Can you go into more detail about what you mean by "stops I/O on the primary"? Do I/O requests to the DRBD volume start failing? Does the primary freeze/act odd in any other way? James Masson wrote: > Hi Jeff, > > thanks for the response. > > I'm using "rsize=32768,wsize=32768,nointr,timeo=300,noatime" > > But I don't see what that has to do with a DRBD Primary not detecting that > it's Secondary is broken, > and disconnecting it. Or the Secondary itself not realising it's broken, when > it should have > disconnected by itself. > > I can reproduce the issue without NFS, just using local filesystem > interaction on the Primary. > > Am I missing something about how write timeouts work on DRBD? > > James > > jeff wrote: > >> Are you mounting the NFS volumes with a timeout? I seem to recall that >> an NFS timeout can really screw with a system, whether it's primary or >> secondary. I usually mount my NFS with the soft,timeo=30 options. >> >> Hope that helps. >> >>> Message: 2 >>> Date: Wed, 02 Dec 2009 09:21:45 +0000 >>> From: James Masson <[email protected]> >>> Subject: Re: [DRBD-user] Primary not disconnecting Secondary with IO >>> problems >>> To: [email protected] >>> Message-ID: <[email protected]> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> >>> >>> has anybody seen this before, got any insight? >>> >>> James >>> >>> James Masson wrote: >>> >>> >>>> Hi list, >>>> >>>> I'm using DRBD and NFS to provide HA to Virtual Machine images between >>>> pairs of storage servers. >>>> >>>> Systems are RHEL5.4 2.6.18-164.el5 + drbd8.3 from Centos Extras >>>> >>>> We've been having issues where disk I/O problems on the DRBD Secondary >>>> stops all IO to the Primary >>>> too. DRBD doesn't seem to recognise these disk I/O problems, the Secondary >>>> isn't disconnected >>>> automatically. Everything just hangs. >>>> >>>> During this state: >>>> If I try a "drbdadm disconnect all" on the Primary, the command hangs. >>>> If I try this on the Secondary, the command eventually completes, and NFS >>>> I/O returns to normal >>>> operation on the Primary. >>>> >>>> I've tried the following things to fix this: >>>> >>>> 1) Putting in a custom local-io-error handler to hard reset the problem >>>> node. >>>> >>>> This never triggers. Just like the default "detach", never triggers. >>>> >>>> 2) Changing the net connection parameters to: >>>> >>>> net { >>>> ko-count 2; >>>> timeout 20; >>>> } >>>> >>>> Again, this never triggers. >>>> >>>> >>>> 3) Changing the protocol used from C to B >>>> >>>> Doesn't have any effect on the issue - I'd prefer to use C anyway. >>>> >>>> >>>> Any further ideas on how to track this issue down and fix it? >>>> >>>> thanks >>>> >>>> James Masson >>>> _______________________________________________ >>>> drbd-user mailing list >>>> [email protected] >>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>> >>>> >> _______________________________________________ >> drbd-user mailing list >> [email protected] >> http://lists.linbit.com/mailman/listinfo/drbd-user >> _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
