Stephen C. Tweedie writes:
> Hi,
> 
> On Sun, 18 Oct 1998 15:55:35 +0200 (CEST), MOLNAR Ingo
> <[EMAIL PROTECTED]> said:
> 
> > On Sun, 18 Oct 1998, Tod Detre wrote:
> 
> >> in 2.1 kernels you can mak nfs a block device.  raid can work with block
> >> devices so if you raid5 several nfs computers one can go down, but you
> >> still can go on. 
> 
> > you probably want to use Stephen Tweedie's NBD (Network Block Device),
> 
> Heh, thanks, but the credit is Pavel Machek's.  I've just been testing
> and bug-fixing it.
> 
> > which works over TCP and is such more reliable and works over bigger
> > distance and larger dropped packets range. You can even have 5 disks on 5
> > continents put together into a big RAID5 array. (ment to survive a
> > meteorite up to the size of a few 10 miles ;) and you can loopback it
> > through a crypt^H^H^H^H^Hcompression module too before sending it out to
> > the net. 
> 
> Of course, you'll need to manually reconstruct the raid array as
> appropriate, and you don't get raid autostart on a networked block
> device either.  However, it ought to be fun to watch, and I'm hoping we
> can integrate this method of operation into some of the clustering
> technology now appearing on Linux to do failover of NFS services if one
> of the networked raid hosts dies.  Just remount the raid on another
> machine using the surviving networked disks, remount ext2fs and migrate
> the NFS server's IP address: voila!

There's a way which should give better performance in the general case
that I think I've mentioned on this mailing list before. It avoids the
overhead of a synchronous NBD since when you're migrating a disk to a
new system, there's no constraint that the remote system have up to
date data all the time. It's a combination of a little kernel driver
called breq and a simple user-mode program. The basic idea is to add
a few lines to make_request in ll_rw_blk.c in "case WRITE". When breq
is turned on for a particular device (done by an ioctl on /dev/breq
which appears as a character device to user-land), the block number of
the request is simply written to a 4K ring buffer. That's the only
kernel patch needed. The breq device driver module sucks out the data
from the ring buffer and feeds it to the reader.

To do a filesystem migration, there's a bmigrate user-land program
which effectively has two independent threads (actually it uses
select() but it's easier to think of as threads). You start with a
bitmap with one bit for each block on the device you're migrating,
setting the bitmap to all 1s. You make a TCP connection to a daemon
on the new system (described below). One thread of bimgrate does

    while (1) {
        blocknum_t n; /* 32 bits */
        read(breq_fd, &n, 4);
        bitmap[n] = 1; /* mark block n dirty */
    }

The other thread does

    while (n = find_first_set_bit(bitmap)) {
        struct { int n; char data[512] } binfo;
        bitmap[n] = 0; /* mark it clean */
        read(raw_device_fd, &binfo.data, sizeof(binfo.data));
        binfo.n = n;
        write(remote_socket, &binfo, sizeof(binfo));
    }

The daemon on the other end of the connection just does

    while (read(client_socket, &binfo, sizeof(binfo))) {
        lseek(raw_device_fd, binfo.n * 512, SEEK_SET);
        write(raw_device_fd, &binfo.data, 512);
    }

This is all completely asynchronous to the migrating filesystem so
it's not as slow as a network block device. Now, gradually the
bitmap gets cleared as the migrater writes across the data and
catches up with ongoing write activity. Eventually, there are only
a "few" bits set. At that time, you take down the RAID device, let
the migrater finish sending the last few blocks, then bring up the
new system on the same IP number with its newly migrated data.

The code for bmigrate is sitting on my IPC at home and I haven't
quite had the time to do the breq thing properly yet. I've not quite
figured out what context make_request runs in and how to synchronise
writing to the ring buffer with the ioctl code to shut it off.
Does make_request get called from interrupts or bottom halves?
What's the new-fangled SMP-safe way to do such locking in a way that
make_request doesn't have to get a slow lock every time it wants to
write data?

--Malcolm

-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services

Reply via email to