Re: Failsafe Servers

Daniel D. Arrasjid Mon, 4 May 1998 23:09:24 +0200 (MET DST)
>From [EMAIL PROTECTED]:
> Currently we have 2 fileservers for user home and each server is connected
> with a deck of disks.  We are thinking of the possibility of enhancing
> high reliability and availabilty of the AFS service.  For disk, we can
> consider RAID to improve the its reliability and availability.  For the 
> servers, we wonder if there is any ways for us to implement the high 
> availability solution.  Imagine that we can merge the disks of the 2 
> fileservers to a RAID and 2 hosts can be the backup of each other.  With 
> NFS, there is hardware and software solution for failsafe servers.  We 
> wonder if there is any way that we can implement it with AFS.  Specifically,
> if 1 server is down with hardware problem, is there any way that we can 
> specify the second server to be the primary server and the AFS is provided 
> without or with minimal interruption?
> 
> Thanks in advance.
> 
> ---------------------------------------------------------------------------
> K. K. Tam                                   | Email: [EMAIL PROTECTED]       |
> Centre of Computing Services &              |                             | 
>           Telecommunications                |                             |
> Hong Kong University of Science & Technology| Tel: (852) 2358 6246        |
> Clear Water Bay                             |                             |
> Kowloon, Hong Kong                          | Fax: (852) 2358 0967      |
> ---------------------------------------------------------------------------

 
We're about to deploy DCE/DFS services for High Availability on SUN
systems.   The model is very similar to IBM's DFS HA solution.
Basically, you have two machines that are attached to the same
storage system.    All DCE/DFS related files are located on this
shared storage system.    I imagine, a similar solution could be
implemented with AFS.

The primary machine is configured as the DFS server, the secondary
machine simply watches for the primary to fail(using a commerical
solution).   When failure is detected, the secondary machine changes
it's ip address, imports the storage system, and starts up DCE and DFS
services.

The DCE/DFS startup/recovery process takes about 10 minutes to complete.

Say you have 2 machines: machine "ra" is the DFS server, machine "aten"
is booted and running solaris but is otherwise idle. 
 
To manually test failover, we run a script on aten to cause the failover.
The script attempts to shut down DFS, DCE, and Solaris on ra.  Next it
changes the hostname and IP number of aten.  Then it imports the
disk groups from ra's disk array (vximport) and does a volume recovery.
Finally, it starts DCE and DFS.
 
To switch back to "ra" from "aten", we run the /usr/local/bin/failback script
on aten, followed by the /usr/local/bin/failback (different) script on ra.
 
In production, the HA software would detect the failure of the primary
DFS server(ra) over a private connection between the two machines,
ensure that the primary is powered off(remote control) and that it will
not attempt to restart, then run modified versions of the failover
script to change the IP number of aten, import the disk groups, do
volume recovery, then start DCE and DFS services.

The test script is included below.   
 
aten:/usr/local/bin/failover
 
#!/bin/ksh -x
# hostnames, ip/ether addresses changed to protect the innocent
 
#===================
# Halt ra
# Simulates failed DFS server detection and power off 
# done by HA software
#===================
/usr/bin/rsh ra /etc/init.d/dfs stop
/usr/bin/rsh ra /etc/init.d/dce stop
/usr/bin/rsh ra /sbin/umount /dce
/usr/bin/rsh ra /usr/sbin/vxdg deport phgroup01
/usr/bin/rsh ra /usr/sbin/vxdg deport phgroup02
/usr/bin/rsh ra halt
sleep 30
 
#===================
# Assume ra.dce's IP number and MAC address
#===================
/sbin/ifconfig hme0 inet X.X.X.X up
/sbin/ifconfig hme0 ether X:X:XX:XX:XX:XX
/usr/bin/hostname ra
 
#===================
# Preserve network settings
#===================
/bin/cp /etc/ra-hostname /etc/hostname.hme0
/bin/cp /etc/ra-hostname /etc/nodename
# Also see /etc/init.d/rc2.d/S72inetsvc
/bin/cp /etc/rc2.d/ra-S72inetsvc /etc/rc2.d/S72inetsvc
 
#===================
# Import disk groups
#===================
/usr/sbin/vxdg -C import phgroup01
/usr/sbin/vxdg -C import phgroup02
/usr/sbin/vxrecover -g phgroup01 -sb
/usr/sbin/vxrecover -g phgroup02 -sb
/sbin/mount /dev/vx/dsk/phgroup02/phvol02 /dce
 
#===================
# Start DCE and DFS
#===================
/etc/init.d/dce start
/etc/init.d/dfs start
 



-- 
Daniel D. Arrasjid                   Computing and Information Technology
Voice: (716) 645-6153                State University of New York at Buffalo
Fax:   (716) 645-5972                301 Computing Center, Buffalo, NY 14260
E-Mail: [EMAIL PROTECTED]      WWW: http://www.acsu.buffalo.edu/~daniel
PGP public key: http://www.acsu.buffalo.edu/~daniel/key.html
Re: Failsafe Servers

Reply via email to