Folks,
kernel 2.2.13ac1, patched with ide.2.2.13.19991111.patch and
raid0145-19990824-2.2.11. I know this is no longer "state of the art", but it
was pretty solid in its day. Recently, we've had 2 events which took our the
entire raid5 array, both followed the same pattern. Here's the sequence:
Drive loses DMA for some reason.
Jul 14 09:20:11 osmin kernel: hdi: timeout waiting for DMA
Jul 14 09:20:11 osmin kernel: hdi: irq timeout: status=0xd0 { Busy }
Jul 14 09:20:11 osmin kernel: hdi: DMA disabled
Jul 14 09:20:12 osmin kernel: ide4: reset: success
Further attempts to access the disk lead to:
Jul 14 09:22:25 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58
Jul 14 09:22:25 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete
DataRequest }
Jul 14 09:22:25 osmin kernel: ide4: reset: success
Jul 14 09:27:32 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58
Jul 14 09:27:32 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete
DataRequest }
Jul 14 09:27:32 osmin kernel: ide4: reset: success
This goes on for hours and hours, and the drive is still marked active in
mdstat. Finally, after many hours:
Jul 15 00:25:45 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58
Jul 15 00:25:45 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete
DataRequest }
Jul 15 00:25:45 osmin kernel: ide4: reset: success
Jul 15 00:25:47 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58
Jul 15 00:25:47 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete
DataRequest }
Jul 15 00:25:47 osmin kernel: ide4: reset: success
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
Jul 15 00:26:06 osmin kernel: 39:01: rw=0, want=635481100, limit=33417184
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
sector=1270962198 size=1024 count=1
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdk1, disabling device. Operation
continuing on 3 devices
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
Jul 15 00:26:06 osmin kernel: 16:41: rw=0, want=635481100, limit=36630688
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
sector=1270962198 size=1024 count=1
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdd1, disabling device. Operation
continuing on 2 devices
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
Jul 15 00:26:06 osmin kernel: 22:01: rw=0, want=635481100, limit=33417184
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
sector=1270962198 size=1024 count=1
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdg1, disabling device. Operation
continuing on 1 devices
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
Jul 15 00:26:06 osmin kernel: 38:01: rw=0, want=635481100, limit=33417184
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
sector=1270962198 size=1024 count=1
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdi1, disabling device. Operation
continuing on 0 devices
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198
followed by
Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053926987
Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053730379
on and on forever, and the array is dead to the world.
Raid has failed me here. I lost one disk, I lost them all. The reason I
installed RAID simply led me to a larger catastrophe. Why? Yes, I can reboot
and fsck the array, but files are missing (old files not recently accessed)
and there's repairing to be done. Not an ideal solution.
My question is this: do the diagnostics above point to a misconfig on my part,
or is this a shortcoming in Raid's ability to cope with a drive with DMA
disabled?
-Darren