My impression is that there is quite a range of Linux experience
floating around PLUG, so I offer the following tale of goodness and
world domination for those who haven't yet played with Linux'
kernel-based RAID functionality.
At $WORK, I specify all servers to ship with an even number of hard
drives. Most arrive with two drives, but some come with more.
On the typical two-drive server, I setup a four partitions on each
drive:
* swap -- a couple gigs
* /boot -- 300 MB
* / -- 20 to 40 GB
* /srv -- [remainder]
We mount /home from a central filestore, so I don't need any local
space for home directories.
I create identical partition tables on both hard drives and mirror
them using software RAID-1. I'll note that this is NOT a surrogate for
backups; it's merely protection against drive failure.
I also setup the smartmontools to test and monitor the drives fairly
aggressively, sending me an e-mail on error conditions:
# /etc/smartd.conf
/dev/sda -a -d ata -m me@domain -o on -s (S/../.././01|L/../../6/02)
/dev/sdb -a -d ata -m me@domain -o on -s (S/../.././01|L/../../6/02)
One of our servers recently started throwing drive errors. Both smartd
AND mdadm (the RAID userspace utility) let me know that /dev/sda was
having problems. In my experience, that means I need to order a new
drive pronto.
The recent Thai floods have elevated drive prices, but I was able to
find an appropriate replacement drive at newegg.com, so I placed the
order.
Then I told the system to break the RAID mirrors by failing and
removing the partitions on /dev/sda from the various metadevices:
mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md1 --fail /dev/sda3
mdadm --manage /dev/md1 --remove /dev/sda3
mdadm --manage /dev/md2 --fail /dev/sda2
mdadm --manage /dev/md2 --remove /dev/sda2
mdadm --manage /dev/md3 --fail /dev/sda4
mdadm --manage /dev/md3 --remove /dev/sda4
I noted in dmesg that /dev/sda was on scsi_host 0:
sd 0:0:0:0: Attached scsi disk sda
The drive arrived a couple days later. Since the machine in question
is a rack-mount server, I was able to remove the failed drive by just
sliding it out of the front of the case.
I slid the replacement drive into the bay. And then the fun part:
# verify that /dev/sda is currently unreadable
hdparm -I /dev/sda
# ask the kernel to rescan scsi_host 0
echo "0 0 0" > /sys/class/scsi_host/host0/scan
# see if the drive responds to roll call after scan
hdparm -I /dev/sda
# verify the new drive has no partition table
fdisk -l /dev/sda
# copy parititon table from sdb to sda
sfdisk -d /dev/sdb | sfdisk /dev/sda
# add sda's partitions to the RAID sets
mdadm --manage /dev/md0 --add /dev/sda1
mdadm --manage /dev/md1 --add /dev/sda3
mdadm --manage /dev/md2 --add /dev/sda2
mdadm --manage /dev/md3 --add /dev/sda4
Once the partitions were all synchronzied (about 4 hours later), I
installed grub on sda's master boot record to make sure that I could
boot from that drive should sdb ever fail.
grub> find /grub/stage1
find /grub/stage1
(hd0,0)
(hd1,0)
grub> root (hd0,0)
root (hd0,0)
Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd0)
setup (hd0)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd0)"... 15 sectors are embedded.
succeeded
Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2
/grub/grub.conf"... succeeded
Done.
Et voila!
The system never had to come down, and it never was terribly
inconvenienced except for the I/O load during resync.
--
Paul Heinlein <> [email protected] <> http://www.madboa.com/
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug