On 6/26/06, Lars Ellenberg <[email protected]> wrote:
> / 2006-06-24 12:33:38 +0200
> \ Andreas Schader:
> > I found out, that when I reboot both nodes with "shutdown -r now" at
> > the same time the syncing starts after both are up again and soon
> > after that secondary goes back to "Consistent" in /proc/drbd.
>
> is this a dedicated replication link?

yes, each machine has two NICs, one for the NFS lan and one connected
with a crossover cable with 1Gbit.

> as a workaround for whatever the real problem is with your setup,
> try to minimize nfs-activity, and do <...>

I minimised io load and nfs traffic by turning off the
nfs-kernel-server and doing only some local testing on the primary
node.

Here is what I did to get drbd to stop working:

# create a file on the primary drbd node
[root@nas1:/data1]# echo hello > /data1/testfile1

# unplug the network cable of the crossover link
# to simulate a network failure

# make changes to the primary file system while the secondary is not syncing
[root@nas1:/data1]# echo hello > /data1/testfile2

# reconnect the network cable

# after just a few bytes changed on disk secondary goes Inconsistent
[root@nas2:~]# cat /proc/drbd
version: 0.7.18 (api:78/proto:74)
SVN Revision: 2176 build by root@nas2, 2006-06-22 22:05:30
 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent
    ns:0 nr:1623114 dw:1623114 dr:0 al:0 bm:572 lo:0 pe:0 ua:0 ap:0


# some more disk activity on primary while secondary is Inconsistent
[root@nas1:/data1]# echo hello > /data1/testfile3
[root@nas1:/data1]# echo hello > /data1/testfile4
# the testfile4 echo already hangs and never returns to the prompt

# to get primary working again I disconnect the resources
# this causes primary to finish the testfile4 echo and return to the prompt
[root@nas2:~]# drbdadm disconnect all

# now I try the suggested workaround
[root@nas1:/data1]# perl -e '$x = "X" x (500*1024*1024)'
[root@nas2:~]# perl -e '$x = "X" x (500*1024*1024)'
[root@nas2:~]# drbdadm connect all
[root@nas2:~]# cat /proc/drbd
version: 0.7.18 (api:78/proto:74)
SVN Revision: 2176 build by root@nas2, 2006-06-22 22:05:30
 0: cs:WFBitMapT st:Secondary/Primary ld:Inconsistent
    ns:0 nr:0 dw:1623114 dr:0 al:0 bm:572 lo:0 pe:0 ua:0 ap:0


but secondary remains inconsistent and still prints thousands of
drbd0: [drbd0_receiver/9922] sock_sendmsg time expired, ko = 4294967281
lines in syslog.

after I rebooted both nodes it was working again.

> some technical background:

I skiped the io system analysis for now, because I don't think this is
causing the problems because it can be simulated with very small
changes to the filesystem which shouldn't have an impact on the
performance of the disks. And to be honest I lack the experience with
the tools you suggested to know what I am looking for anyway ;-)

I will try to get hold of some other hardware and will try to test it
with smaller drbd devices. But in the meantime any more thoughts on
this would be really appreciated.

best regards,
Andreas
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to