On Saturday July 15, [EMAIL PROTECTED] wrote:
>
> Folks,
>
> kernel 2.2.13ac1, patched with ide.2.2.13.19991111.patch and
> raid0145-19990824-2.2.11. I know this is no longer "state of the art", but it
> was pretty solid in its day. Recently, we've had 2 events which took our the
> entire raid5 array, both followed the same pattern. Here's the sequence:
... stuff deleted
> Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
> Jul 15 00:26:06 osmin kernel: 39:01: rw=0, want=635481100, limit=33417184
> Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
>sector=1270962198 size=1024 count=1
> Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdk1, disabling device.
>Operation continuing on 3 devices
> Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198
> Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
> Jul 15 00:26:06 osmin kernel: 16:41: rw=0, want=635481100, limit=36630688
> Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
>sector=1270962198 size=1024 count=1
> Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdd1, disabling device.
>Operation continuing on 2 devices
> Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
> Jul 15 00:26:06 osmin kernel: 22:01: rw=0, want=635481100, limit=33417184
> Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
>sector=1270962198 size=1024 count=1
> Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdg1, disabling device.
>Operation continuing on 1 devices
> Jul 15 00:26:06 osmin kernel: attempt to access beyond end of device
> Jul 15 00:26:06 osmin kernel: 38:01: rw=0, want=635481100, limit=33417184
> Jul 15 00:26:06 osmin kernel: dev 09:01 blksize=4096 blocknr=635481099
>sector=1270962198 size=1024 count=1
> Jul 15 00:26:06 osmin kernel: raid5: Disk failure on hdi1, disabling device.
>Operation continuing on 0 devices
> Jul 15 00:26:06 osmin kernel: raid5: restarting stripe 1270962198
>
> followed by
>
> Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block
>4053926987
> Jul 15 00:26:06 osmin kernel: raid5: md1: unrecoverable I/O error for block
>4053730379
>
> on and on forever, and the array is dead to the world.
The "attempt to access beyond end of device" is (I believe) and ext2
bug which was fixed shortly after 2.2.13.
Unfortuantely RAID5 doesn't check the block address before passing the
request down to the lower level, gets an error back, thinks it means
that the disc is dead, fails the disc and retries on another disc,
which will also fail because the problem isn't the disc, it is the
address.
This is fixed in 2.4 (As of about a week ago) and would be fairly easy
to fix in 2.2...
But I suggest you go for a slightly more recent kernel (says he who is
still using 2.2.13 himself :-()
>
> Raid has failed me here. I lost one disk, I lost them all. The reason I
> installed RAID simply led me to a larger catastrophe. Why? Yes, I can reboot
> and fsck the array, but files are missing (old files not recently accessed)
> and there's repairing to be done. Not an ideal solution.
>
> My question is this: do the diagnostics above point to a misconfig on my part,
> or is this a shortcoming in Raid's ability to cope with a drive with DMA
> disabled?
In summary:
- The only mis-config you made was not to use a new enough kernel.
- Yes. There is a shortcoming in the raid code.
NeilBrown
>
> -Darren
>