Pete French presumably uttered the following on 07/21/08 07:08:
The *big* issue I have right now is dealing with the slave machine going
down. Once the master no longer has a connection to the ggated devices,
all processes trying to use the device hang in D status. I have tried
pkill'ing ggatec to no avail and ggatec destroy returns a message of
gctl being busy. Trying to ggatec destroy -f panics the machine.

Oddly enough, this was the issue I had with iscsi which made me move
to using ggated instead. On our machines I use '-t 10' as an argument to
ggatec, and this makes it timeout once the connection has been down for
a certain amount of time. I am using gmirror on top, not ZFS, and this
handled the drive vanishing from the mirror quite happily. I haven't
tried it with ZFS, which may not like having the device suddenly dissapear.

-pete.

What I have found is that the master machine will lock up if the slave disappears during a large file transfer. I tested this by setting up zpool mirror on the master using a ggatec device from the slave. Then I:

pkill'ed ggated on the slave machine.

dd if=/dev/zero of=/data1/testfile2 bs=16k count=8192   [128MB] on the master

The dd command finished and the /var/log/messages showed I/O errors to the slave drive as expected. Messages also showed ggatec trying to reconnect every 10 seconds (ggatec was started with the -t 10 parameter).

Finally zfs marked the drive unavailable which then allowed me to ggatec destroy -u 0 without getting the "ioctl(/dev/ggctl): Device busy" error message. (By the way, using ggatec destroy does not kill the "ggatec create" that created the process to begin with, I had to pkill ggatec to get that stop - bug?)

The above behavior would be acceptable for multi-machine mirroring as it would be scriptable.

The problem comes with Large writes. I tried to repeat the above with

dd if=/dev/zero of=/data1/testfile2 bs=16k count=32768 [512MB]

which then locks zfs, and ultimately the system itself. It seems once the write size/buffer is full, zfs is unable to fail/unavail the slave drive and the entire system becomes unresponsive (cannot even ssh into it).

The bottom line is that without some type of "timeout" or "time to fail" (bad I/O to fail?) zpool + ggate[cd] seems to be an unworkable solution. This is actually a shame as the recover process swapping from master to slave and back again was so much cleaner and faster than using gmirror.


Sven
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to