Folks,

kernel 2.2.13ac1, patched with ide.2.2.13.19991111.patch and 
raid0145-19990824-2.2.11. I know this is no longer "state of the art", but it 
was pretty solid in its day. Recently, we've had 2 events which took our the 
entire raid5 array, both followed the same pattern. Here's the sequence:

Drive loses DMA for some reason. 

Jul 14 09:20:11 osmin kernel: hdi: timeout waiting for DMA 
Jul 14 09:20:11 osmin kernel: hdi: irq timeout: status=0xd0 { Busy } 
Jul 14 09:20:11 osmin kernel: hdi: DMA disabled 
Jul 14 09:20:12 osmin kernel: ide4: reset: success 


Further attempts to access the disk lead to:

Jul 14 09:22:25 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 14 09:22:25 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 14 09:22:25 osmin kernel: ide4: reset: success 
Jul 14 09:27:32 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 14 09:27:32 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 14 09:27:32 osmin kernel: ide4: reset: success 

This goes on for hours and hours, and the drive is still marked active in 
mdstat. Finally, after many hours:

Jul 15 00:25:45 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 15 00:25:45 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 15 00:25:45 osmin kernel: ide4: reset: success 
Jul 15 00:25:47 osmin kernel: hdi: write_intr error2: nr_sectors=1, stat=0x58 
Jul 15 00:25:47 osmin kernel: hdi: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Jul 15 00:25:47 osmin kernel: ide4: reset: success 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 39:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdk1, disabling device. Operation 
continuing on 3 devices 
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 16:41: rw=0, want=635481100, limit=36630688 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdd1, disabling device. Operation 
continuing on 2 devices 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 22:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdg1, disabling device. Operation 
continuing on 1 devices 
Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device 
Jul 15 00:26:06 osmin kernel: 38:01: rw=0, want=635481100, limit=33417184 
Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099 
sector=1270962198 size=1024 count=1 
Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdi1, disabling device. Operation 
continuing on 0 devices 
Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198 

followed by

Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053926987 
Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block 4053730379 

on and on forever, and the array is dead to the world.

Raid has failed me here. I lost one disk, I lost them all. The reason I 
installed RAID simply led me to a larger catastrophe. Why? Yes, I can reboot 
and fsck the array, but files are missing (old files not recently accessed) 
and there's repairing to be done. Not an ideal solution.

My question is this: do the diagnostics above point to a misconfig on my part, 
or is this a shortcoming in Raid's ability to cope with a drive with DMA 
disabled?

-Darren


Reply via email to