Paul Robins wrote:

> Derek, if you're reading this, i've tried yanking a drive before and the
> system doesn't crash, but any disk access hangs. I wish i could spec a
> higher quality controller but i have a feeling it will be rejected
> outright.

Yanking a drive is very different than a hard disk crash.   Yanking a
drive is equivalent to a controller crash.  Do you really expect to be
removing disks while the machine is operating?

> If either of you could weigh in on AFS on top of DRBD i'd appreciate it,
> I'm not fully up on whether a second server with an identical filesystem
> could be made to take over a crashed AFS machine. I appreciate all the
> help so far and wish there was a way i could donate back.

There are probably many issues here:

(1) only one of the physical machines can be running AFS servers at a
    time if they are supposed to be accessing the same data.

(2) when the server that is running the AFS servers has a disk failure,
    you will need to detect this and signal one of the other machines
    with access to the shared file system to start AFS servers

(3) when those AFS servers start, there will not have been a proper
    shutdown of the prior AFS servers from the perspective of the
    file system.  Therefore a volume salvage operation will have to
    be performed.  This normally takes awhile and I have no idea how
    doing so on a network distributed file system with a failed machine
    is going to increase the time necessary for the operation to
    complete.

(4) the clients are told to find the volumes on a server with a
    particular IP address.   The machine now running the AFS services
    will have to start using the IP address of the failed machine which
    may very well confuse the network RAID operations OR the volume
    database will have to be modified to know about the new file server
    IP address and alter the locations of all of the affected volumes.

This transition is going to result in an outage of some period of time
for the users of all of the affected volumes.   Whereas for the scenario
I described earlier there was no outage.

Note that if you have no budget.  You are probably better off buying
fewer machines and better disks and RAID controllers.

Jeffrey Altman

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to