hi all
i'm trying to implement a high availability system and would like to share
my experiences. and i have some questions about md of course.
my current setup is:
-Kernel 2.2.3 with raid0145-19990309-2.2.3/raidtools-19990309-0.90
-2x Asus PCI/I-486SPG3G (NCR53c810 onboard with SDMS-BIOS)
-2x AHA1542CF
-2x DCAS-34330 (4GB)
-2x IDE-disk with the linux-system
external termination DCAS external termination
| | |
1542<-----internal SCSI cable<----->1542
NCR<-----internal SCSI cable<----->NCR
| | |
onboard termination DCAS onboard termination
what i have learned so far:
-the 1542-setup works fine - even if one card is not connected to a mainboard
or has no power. of course i'm using active terminators. i had to deactivate
"reset scsi bus at power on" and "host adapter bios" in the 1542-bios (the
last option scans for ID 0 forever if it is not present).
-the NCR-setup doesn't survive power loss of one mainboard. i will buy active
terminators for the internal cable. changing the host-ID of the adapters is
no problem with linux if you modify drivers/scsi/53c7,8xx.c:
host->this_id = 5; /* HERE */
if (expected_id == -1 || host->this_id != expected_id)
printk("scsi%d : using initiator ID %d\n", host->host_no,
host->this_id);
my SDMS-BIOS uses ID 7 to scan the SCSI-bus causing a problem if both
computers are booting at the same time. i solved this by deactivating the
adapter in the BIOS - it doesn't matter for linux.
-the SCSI error-detection and handling is bad. taking power from one drive or
removing terminators often locks up the kernel with running scsi-modules or
causes insmod for scsi-modules to take hours and make the system unstable.
i will have to install some watchdog-hardware or a facility to hardware-reset
or poweroff the computers via software. logging of insmod-commands to avoid
broken adapters/scsi-disks after next reboot also seems a good idea.
here are my complete results:
power off for 1542-disk power off for NCR-disk
---------------------------------------------------------------------------
insmod OK OK
raid1 IO-errors with cp or reset lockup
raid5 lockup after some time OK or kernel-panic
termination off for 1542-disk termination off for NCR-disk
---------------------------------------------------------------------------
insmod takes forever,unstable system not tested
raid1 reset not tested
raid5 lockup not tested
lockups are tolerable because the other computer can detect them. undefined
states with IO-errors while doing filesystem operations or an unstable
system are very bad. any idea how to fix this or experience with other
adapters ? at an early stage of my tests i had some cases where the
filesystem couldn't be repaired but i think i made something wrong with the
raid-setup (e.g. using autodetect feature :).
-md-specific stuff/questions
-are there any hardware-failures that could render the array or an
ext2-filesystem on it useless (excluding failure of 2 disks of course) ?
-i think there are hardware-failures where an ext2-filesystem cannot be
repaired non-interactive (with e2fsck -f -y) e.g. error while updating
ext2-superblock. am i right ?
-generating a raid5-array with chunksize 32kb locks the kernel
-i had problems with raid-autodetection and restarting arrays where
a disk is missing or device names have changed. it works better without this
feature in the kernel.
-sometimes the array has enough disks but doesn't start because the wrong
disks are marked as unuseable. e.g. if i make a raid5 array on
sdb1/sdb2/sda1, power sda1 off and on while writing, reboot and swap the
order for insmoding scsi-controllers (array now on sda1/sda2/sdb1), sdb1 and
sdb2 are marked unuseable (md really means sda1/sda2 here ?). perhaps because
they are on the same disk ? won't i get this error if all raid5-partitions
are on different disks and one device is removed from the scsi-scan so that
all other device names change ?
-md doesn't work without persistent-superblock option (strictly use the info
in /etc/raidtab). regarding the above errors this feature is needed.
cu,
brunni