Re: failsafe scripts

W. Phillip Moore Mon, 28 Oct 1996 17:09:05 GMT
>>>>> "norton" == norton  <[EMAIL PROTECTED]> writes:

norton> There are several vendors now offering "failsafe"
norton> capabilities.  For instance, having a dual ported dual
norton> redundant RAID controller, so that two systems can access
norton> (different) RAID sets on the same set of RAID controllers, and
norton> if one server fails, the RAID sets can be mounted on the other
norton> server and ip addresses get pushed around, etc. etc. and life
norton> goes on with users realizing a system has failed.  (Well,
norton> that's the theory at least).

This is impressive technology, but I think there are some shortcoming
to your proposed configuration.

norton> In theory, you would do something like (under AFS 3.4):

norton> o mount the RAIDs from the failed machine on the good machine
norton> o restart bosserver and fs processes
norton> o run vos syncvldb <good server> <just mounted partitions>
norton> o may need to run vos changeaddr <failed ip> <good server ip>

norton> (Putting things back when the server is back online is almost,
norton> but not quite the reverse process).

You need to evaluate the time it takes to actually perform the above
operation (very non-trivial) with the amount of time it would take to
swap the bad CPU with a replacement.

fscking, salvaging, and attaching all those volumes will be a constant
in both possible configurations (CPU swap vs. alternate startup).
This alone can take, even with fast CPUs and disks, 30 minutes or
longer (my number is obviously bogus, as this depends a *lots* of
variables, such as CPU and disk performance, amount of data, etc).

vos syncserv can take a long time to scan a large server with a lot of
data, perhaps another 20-30 minutes.

In either case, I don't beleive this will be transparent to users.
Volumes *will* be inaccessible for a non-zero period of time, and
access to them will timeout, applications fail, etc.

Given the expense of the hardware, and the additional overhead of
actually performing the administrative work to make the cell know
about the new locations (which is also error prone, and another
possible source of outages), my opinion is its not worth the effort.

You are better off having spare hardware available, so you can swap
the bad CPU out, and bring up the affected fileserver, rather than try
to relocate all the disk to a new server, and bring the data into the
cell in a new location.

W. Phillip Moore                                        Phone: (212)-762-2433
Information Technology Department                         FAX: (212)-762-1009
Morgan Stanley and Co.                                     E-mail: [EMAIL PROTECTED]
750 7th Ave, NY, NY 10019
Re: failsafe scripts

Reply via email to