I don't understand why I'm getting such mysterious crashes on one out of
four identical Linux SMP 2.2.14 servers running RAID5. The machine just goes
down without a warning. Sometimes it manages to reboot, sometimes the
superblocks get out of sync so I've had to mkraid --force them right again.
I was convinced it was a hardware error so I've switched the disks (6 IBM
SCSI U2W LVD) over to a new machine with identical hardware. Dual PIII, EPOX
EP-BXBS with onboard Adaptec Ultra-2 SCSI controller (AIC 7890), 1 Gb mem.
I've checked the disks with the SCSI-BIOS utilities, no errors found. Kernel
2.2.14 is patched with the "/~mingo patch (http://www.redhat.com/~mingo)",
raidtools 0.90.
Could ONE disk really cause such complete lockups again and again? Running
seemingly perfectly for several days between crashes? There are NO warnings
in the logs before the crashes. Crashes seem to appear randomly, and don't
seem to be connected to any special time or load. (Several crashes happened
in the middle of the night when the activity was at its lowest.)
N.B: three other identical machines with equal load and usage have never had
any trouble during months of intense usage.
After a crash and reboot, these are the reports:
syslog is full of this:
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
30!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
62!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
94!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
126!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
31!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
63!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
95!
Mar 4 21:31:33 www5 kernel: set_blocksize: b_count 1, dev md(9,0), block
127!
dmesg is full of these lines:
ll_rw_block: device 09:00: only 4096-char blocks implemented (1024)
ll_rw_block: device 09:00: only 4096-char blocks implemented (1024)
ll_rw_block: device 09:00: only 4096-char blocks implemented (1024)
ll_rw_block: device 09:00: only 4096-char blocks implemented (1024)
ll_rw_block: device 09:00: only 4096-char blocks implemented (1024)
additionally, there are several EXT2FS errors reported in the logs after
the crashes, nothing at all before.
My last chance is to copy everything over to a set of six brand new disks.
At that point, _all_ hardware will have been replaced.
If that should fail, I'll be REALLY DESPERATE! My boss will start asking if
this Linux stuff is really that reliable after all - I've been blaming this
on hardware for a couple of weeks now!
Fellow Linux-RAID users, please help me out with this!
/Johan Ekenberg