My impression is that there is quite a range of Linux experience 
floating around PLUG, so I offer the following tale of goodness and 
world domination for those who haven't yet played with Linux' 
kernel-based RAID functionality.

At $WORK, I specify all servers to ship with an even number of hard 
drives. Most arrive with two drives, but some come with more.

On the typical two-drive server, I setup a four partitions on each 
drive:

  * swap   -- a couple gigs
  * /boot  -- 300 MB
  * /      -- 20 to 40 GB
  * /srv   -- [remainder]

We mount /home from a central filestore, so I don't need any local 
space for home directories.

I create identical partition tables on both hard drives and mirror 
them using software RAID-1. I'll note that this is NOT a surrogate for 
backups; it's merely protection against drive failure.

I also setup the smartmontools to test and monitor the drives fairly 
aggressively, sending me an e-mail on error conditions:

   # /etc/smartd.conf
   /dev/sda -a -d ata -m me@domain -o on -s (S/../.././01|L/../../6/02)
   /dev/sdb -a -d ata -m me@domain -o on -s (S/../.././01|L/../../6/02)

One of our servers recently started throwing drive errors. Both smartd 
AND mdadm (the RAID userspace utility) let me know that /dev/sda was 
having problems. In my experience, that means I need to order a new 
drive pronto.

The recent Thai floods have elevated drive prices, but I was able to 
find an appropriate replacement drive at newegg.com, so I placed the 
order.

Then I told the system to break the RAID mirrors by failing and 
removing the partitions on /dev/sda from the various metadevices:

   mdadm --manage /dev/md0 --fail /dev/sda1
   mdadm --manage /dev/md0 --remove /dev/sda1

   mdadm --manage /dev/md1 --fail /dev/sda3
   mdadm --manage /dev/md1 --remove /dev/sda3

   mdadm --manage /dev/md2 --fail /dev/sda2
   mdadm --manage /dev/md2 --remove /dev/sda2

   mdadm --manage /dev/md3 --fail /dev/sda4
   mdadm --manage /dev/md3 --remove /dev/sda4

I noted in dmesg that /dev/sda was on scsi_host 0:

   sd 0:0:0:0: Attached scsi disk sda

The drive arrived a couple days later. Since the machine in question 
is a rack-mount server, I was able to remove the failed drive by just 
sliding it out of the front of the case.

I slid the replacement drive into the bay. And then the fun part:

   # verify that /dev/sda is currently unreadable
   hdparm -I /dev/sda

   # ask the kernel to rescan scsi_host 0
   echo "0 0 0" > /sys/class/scsi_host/host0/scan

   # see if the drive responds to roll call after scan
   hdparm -I /dev/sda

   # verify the new drive has no partition table
   fdisk -l /dev/sda

   # copy parititon table from sdb to sda
   sfdisk -d /dev/sdb | sfdisk /dev/sda

   # add sda's partitions to the RAID sets
   mdadm --manage /dev/md0 --add /dev/sda1
   mdadm --manage /dev/md1 --add /dev/sda3
   mdadm --manage /dev/md2 --add /dev/sda2
   mdadm --manage /dev/md3 --add /dev/sda4

Once the partitions were all synchronzied (about 4 hours later), I 
installed grub on sda's master boot record to make sure that I could 
boot from that drive should sdb ever fail.

   grub> find /grub/stage1
   find /grub/stage1
    (hd0,0)
    (hd1,0)
   grub> root (hd0,0)
   root (hd0,0)
    Filesystem type is ext2fs, partition type 0xfd
   grub> setup (hd0)
   setup (hd0)
    Checking if "/boot/grub/stage1" exists... no
    Checking if "/grub/stage1" exists... yes
    Checking if "/grub/stage2" exists... yes
    Checking if "/grub/e2fs_stage1_5" exists... yes
    Running "embed /grub/e2fs_stage1_5 (hd0)"...  15 sectors are embedded. 
succeeded
    Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 
/grub/grub.conf"... succeeded
   Done.

Et voila!

The system never had to come down, and it never was terribly 
inconvenienced except for the I/O load during resync.

-- 
Paul Heinlein <> [email protected] <> http://www.madboa.com/
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to