Hi!

Neil Brown wrote:
> On Thursday June 14, [EMAIL PROTECTED] wrote:
> > 
> > Dear All
> > 
> > I've just had a disk (sdc) fail in my raid5 array (sdb sdc sdd),
>
> Great!  A real live hardware failure1  It is always more satisfying to
> watch one of those than to have to simulate them all the time!!
> Unless of course they are fatal... not the case here it seems.
>
Well, here comes a _real_ fatal one...
>
> > and I noticed that it triggered a bug (?) in raid5.c, line 523:
> > 
> > kernel: md: bug in file raid5.c, line 523
> > 
> > Is it OK to ignore this ?
>
> Yep, it appears benign.  That MD_BUG will be hit whenever you have two
> requests to the one device fail, which would be pretty common with a
> real hardware failure, and not at all common with simulated failures.
> I'll put it on my todo list to have a closer look and send some sort
> of patch to Linus.

One of our customers had a RAID5 crash this week which lead
to loss of important data!

Here's the story:

The fileserver which failed runs for about a year now, without any 
real problems. It is located in an air-conditioned room in a 19"
rack, together with some other servers. Motherboard is an ASUS P2B-S, 
with 4 IBM DPSS-318350N attached to the built-in Adaptec SCSI
controller.

The system runs Linux-2.2.17 with v0.90.0 RAID patches and
corresponding raidtools-0.90.19990824. Additionally, the system has 
DevFS patched in.

We use a RAID5 configuration: on each SCSI disk there is a big
partition for RAID, so we have 4 RAID partitions: 3 of 
them are in a RAID5 array, one is configured as spare.

Here is the /etc/raidtab:
[...]
raiddev         /dev/md/0
raid-level      5
nr-raid-disks   3
nr-spare-disks  1
chunk-size      32
persistent-superblock   1
parity-algorithm        left-symmetric
device                  /dev/scsi/host0/bus0/target0/lun0/part4
raid-disk               0
device                  /dev/scsi/host0/bus0/target1/lun0/part4
raid-disk               1
device                  /dev/scsi/host0/bus0/target2/lun0/part4
raid-disk               2
device                  /dev/scsi/host0/bus0/target3/lun0/part4
spare-disk              0 
[...]

Now, on Tuesday, Jun 19th, at 09:08:58, the troubles started: 
The kernel reported errors for one of the active RAID disks:
[...]
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36664997, scsi0, channel 0, id 2, lun 0 Read (10) 00 00 3a 5b b4
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665003, scsi0, channel 0, id 2, lun 0 Read (10) 00 00 2c 5c 24
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665005, scsi0, channel 0, id 2, lun 0 Read (10) 00 00 2c 5c 3c
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665013, scsi0, channel 0, id 2, lun 0 Read (10) 00 00 4e 5b d4
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665014, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 5e 5b 84
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665017, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 5e 5b 8c
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665018, scsi0, channel 0, id 2, lun 0 Read (10) 00 01 5e 63 b4
00 00 08 00
Jun 19 09:08:58 wien kernel: scsi : aborting command due to timeout :
pid 36665020, scsi0, channel 0, id 2, lun 0 Write (10) 00 01 fa 5b 84
00 00 08 00
Jun 19 09:08:59 wien kernel: SCSI host 0 abort (pid 36664997) timed
out - resetting
Jun 19 09:08:59 wien kernel: SCSI bus is being reset for host 0
channel 0.
Jun 19 09:09:02 wien kernel: (scsi0:0:0:0) Synchronous at 40.0
Mbyte/sec, offset 63.
Jun 19 09:09:02 wien kernel: (scsi0:0:2:0) Synchronous at 40.0
Mbyte/sec, offset 63.
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
3014736 
[...]

The RAID subsystem noticed the problem, disabled the device
and started re-syncing the spare disk (so far, so good!)
[...]
Jun 19 09:09:02 wien kernel: raid5: Disk failure on
scsi/host0/bus0/target2/lun0/part4, disabling device. Operation
continuing on 2 devices
Jun 19 09:09:02 wien kernel: md: recovery thread got woken up ...
Jun 19 09:09:02 wien kernel: md0: resyncing spare disk
scsi/host0/bus0/target3/lun0/part4 to replace failed disk
Jun 19 09:09:02 wien kernel: RAID5 conf printout:
Jun 19 09:09:02 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:02 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 3, s:1, o:0, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:02 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:02 wien kernel: RAID5 conf printout:
Jun 19 09:09:02 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:02 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 3, s:1, o:1, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:02 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:02 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:02 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:02 wien kernel: md: syncing RAID array md0
Jun 19 09:09:02 wien kernel: md: minimum _guaranteed_ reconstruction
speed: 100 KB/sec.
Jun 19 09:09:02 wien kernel: md: using maximum available idle IO
bandwith for reconstruction.
Jun 19 09:09:02 wien kernel: md: using 256k window.
[...]

In the same scond, the kernel reported another error for the already
disabled disk:
[...]
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
31064064
[...]

And this was it: it seems the RAID code got confused and hit
the MD_BUG:
[...]
Jun 19 09:09:02 wien kernel: interrupting MD-thread pid 6
Jun 19 09:09:02 wien kernel: md: bug in file raid5.c, line 659
Jun 19 09:09:02 wien kernel:
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel: md0:
<scsi/host0/bus0/target3/lun0/part4><scsi/host0/bus0/target2/lun0/part4><scsi/host0/bus0/target1/lun0/part4><scsi/host0/bus0/target0/lun0/part4>
array superblock:
Jun 19 09:09:02 wien kernel:   SB: (V:0.90.0)
ID:<cea5d527.00000000.00000000.00000000> CT:39685ede
Jun 19 09:09:02 wien kernel:      L5 S16860096 ND:4 RD:3 md0 LO:2
CS:32768
Jun 19 09:09:02 wien kernel:      UT:3b2efa8e ST:0 AD:2 WD:3 FD:1 SD:1
CSUM:ed52e5e2 E:00000024
Jun 19 09:09:02 wien kernel:      D  0: 
DISK<N:0,scsi/host0/bus0/target0/lun0/part4(8,4),R:0,S:6>
Jun 19 09:09:02 wien kernel:      D  1: 
DISK<N:1,scsi/host0/bus0/target1/lun0/part4(8,20),R:1,S:6>
Jun 19 09:09:02 wien kernel:      D  2: 
DISK<N:2,scsi/host0/bus0/target2/lun0/part4(8,36),R:2,S:1>
Jun 19 09:09:02 wien kernel:      D  3: 
DISK<N:3,scsi/host0/bus0/target3/lun0/part4(8,52),R:3,S:0>
Jun 19 09:09:02 wien kernel:      D  4:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  5:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  6:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  7:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  8:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  9:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 10:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 11:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      THIS: 
DISK<N:3,scsi/host0/bus0/target3/lun0/part4(8,52),R:3,S:0>
Jun 19 09:09:02 wien kernel:  rdev scsi/host0/bus0/target3/lun0/part4:
O:scsi/host0/bus0/target3/lun0/part4, SZ:16860096 F:0 DN:3 rdev
superblock:
Jun 19 09:09:02 wien kernel:   SB: (V:0.90.0)
ID:<cea5d527.00000000.00000000.00000000> CT:39685ede
Jun 19 09:09:02 wien kernel:      L5 S16860096 ND:4 RD:3 md0 LO:2
CS:32768
Jun 19 09:09:02 wien kernel:      UT:3b2efa8e ST:0 AD:2 WD:3,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 10:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 11:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      THIS: 
DISK<N:1,scsi/host0/bus0/target1/lun0/part4(8,20),R:1,S:6>
Jun 19 09:09:02 wien kernel:  rdev scsi/host0/bus0/target0/lun0/part4:
O:scsi/host0/bus0/target0/lun0/part4, SZ:16860096 F:0 DN:0 rdev
superblock:
Jun 19 09:09:02 wien kernel:   SB: (V:0.90.0)
ID:<cea5d527.00000000.00000000.00000000> CT:39685ede
Jun 19 09:09:02 wien kernel:      L5 S16860096 ND:4 RD:3 md0 LO:2
CS:32768
Jun 19 09:09:02 wien kernel:      UT:3b2efa8e ST:0 AD:2 WD:3 dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 10:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 11:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      THIS: 
DISK<N:1,scsi/host0/bus0/target1/lun0/part4(8,20),R:1,S:6>
Jun 19 09:09:02 wien kernel:  rdev scsi/host0/bus0/target0/lun0/part4:
O:scsi/host0/bus0/target0/lun0/part4, SZ:16860096 F:0 DN:0 rdev
superblock:
Jun 19 09:09:02 wien kernel:   SB: (V:0.90.0)
ID:<cea5d527.00000000.00000000.00000000> CT:39685ede
Jun 19 09:09:02 wien kernel:      L5 S16860096 ND:4 RD:3 md0 LO:2
CS:32768
Jun 19 09:09:02 wien kernel:      UT:3b2efa8e ST:0 AD:2 WD:3 FD:1 SD:1
CSUM:ed6a42cd E:00000024
Jun 19 09:09:02 wien kernel:      D  0: 
DISK<N:0,scsi/host0/bus0/target0/lun0/part4(8,4),R:0,S:6>
Jun 19 09:09:02 wien kernel:      D  1: 
DISK<N:1,scsi/host0/bus0/target1/lun0/part4(8,20),R:1,S:6>
Jun 19 09:09:02 wien kernel:      D  2: 
DISK<N:2,scsi/host0/bus0/target2/lun0/part4(8,36),R:2,S:1>
Jun 19 09:09:02 wien kernel:      D  3: 
DISK<N:3,scsi/host0/bus0/target3/lun0/part4(8,52),R:3,S:0>
Jun 19 09:09:02 wien kernel:      D  4:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  5:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  6:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  7:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  8:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D  9:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 10:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      D 11:  DISK<N:0,[dev
00:00](0,0),R:0,S:4>
Jun 19 09:09:02 wien kernel:      THIS: 
DISK<N:0,scsi/host0/bus0/target0/lun0/part4(8,4),R:0,S:6>
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:
[...]

In the following few seconds, there were quite a few of
those SCSI errors, with RAID STATE printouts following:
[...]
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
3014744
Jun 19 09:09:02 wien kernel: interrupting MD-thread pid 6
Jun 19 09:09:02 wien kernel: md: bug in file raid5.c, line 659 
Jun 19 09:09:02 wien kernel:
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Jun 19 09:09:02 wien kernel:        **********************************
[...]
[complete printout omitted]
[...]
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
20840456
Jun 19 09:09:02 wien kernel: interrupting MD-thread pid 6
Jun 19 09:09:02 wien kernel: md: bug in file raid5.c, line 659
Jun 19 09:09:02 wien kernel:
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Jun 19 09:09:02 wien kernel:        **********************************
[...]
[complete printout omitted]
[...]
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
14548992
Jun 19 09:09:02 wien kernel: interrupting MD-thread pid 6
Jun 19 09:09:02 wien kernel: md: bug in file raid5.c, line 659
Jun 19 09:09:02 wien kernel:
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Jun 19 09:09:02 wien kernel:        **********************************
[...]
[complete printout omitted]
[...]
Jun 19 09:09:02 wien kernel: SCSI disk error : host 0 channel 0 id 2
lun 0 return code = 28000002
Jun 19 09:09:02 wien kernel: [valid=0] Info fld=0x0, Current sd08:24:
sense key Not Ready
Jun 19 09:09:02 wien kernel: Additional sense indicates Logical unit
is in process of becoming ready
Jun 19 09:09:02 wien kernel: scsidisk I/O error: dev 08:24, sector
19529728
Jun 19 09:09:02 wien kernel: interrupting MD-thread pid 6
Jun 19 09:09:02 wien kernel: md: bug in file raid5.c, line 659
Jun 19 09:09:02 wien kernel:
Jun 19 09:09:02 wien kernel:        **********************************
Jun 19 09:09:02 wien kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Jun 19 09:09:02 wien kernel:        **********************************
[...]
[complete printout omitted]
[...]
There were about 25 SCSI error messages in the following few 
seconds, all from the same device (SCSI ID 2)
Then, finally, it got _really_ bad:
[...]
Jun 19 09:09:09 wien kernel: raid5: restarting stripe 20840448
Jun 19 09:09:09 wien kernel: raid5: restarting stripe 22413312
Jun 19 09:09:09 wien kernel: raid5: restarting stripe 21889024
Jun 19 09:09:09 wien kernel: raid5: restarting stripe 14548992
Jun 19 09:09:09 wien kernel: raid5: restarting stripe 31064064
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 19529728
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20840456
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 21889032
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 14549000
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 31064072
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 29491208
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20578336
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 29491232
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 14549024
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 31064096
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 12320808
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 21626920
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 22413352
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 29491240
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 23068720
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 1703984
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20840496
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 21889072
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 14549040
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 31064112
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 22413360
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20840520
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 3014736
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20840528
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 3014744
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20840536
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 3014768
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 3014776
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20578448
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 786592
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 786616
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 4325584
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 91480
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 20842544
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 21891128
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 22939712
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 9832776
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 19532968
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 19532992
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 14552272
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 31067352
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 19533024
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 22416608
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 29494512
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 31315928
Jun 19 09:09:10 wien kernel: raid5: restarting stripe 12336080
Jun 19 09:09:10 wien kernel: md_do_sync() got signal ... exiting
Jun 19 09:09:10 wien kernel: RAID5 conf printout:
Jun 19 09:09:10 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:10 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:10 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:10 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:10 wien kernel:  disk 3, s:1, o:1, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:10 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:10 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:10 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:10 wien kernel: RAID5 conf printout:
Jun 19 09:09:10 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:10 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 3, s:1, o:0, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:11 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:11 wien kernel: md: recovery thread finished ...
Jun 19 09:09:11 wien kernel: md: recovery thread got woken up ...
Jun 19 09:09:11 wien kernel: md0: resyncing spare disk
scsi/host0/bus0/target3/lun0/part4 to replace failed disk
Jun 19 09:09:11 wien kernel: RAID5 conf printout:
Jun 19 09:09:11 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:11 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 3, s:1, o:0, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:11 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:11 wien kernel: RAID5 conf printout:
Jun 19 09:09:11 wien kernel:  --- rd:3 wd:2 fd:1
Jun 19 09:09:11 wien kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1
dev:scsi/host0/bus0/target0/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1
dev:scsi/host0/bus0/target1/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1
dev:scsi/host0/bus0/target2/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 3, s:1, o:1, n:3 rd:3 us:1
dev:scsi/host0/bus0/target3/lun0/part4
Jun 19 09:09:11 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:11 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:09:12 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:12 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:09:12 wien kernel: md: syncing RAID array md0
Jun 19 09:09:12 wien kernel: md: minimum _guaranteed_ reconstruction
speed: 100 KB/sec.
Jun 19 09:09:12 wien kernel: md: using maximum available idle IO
bandwith for reconstruction.
Jun 19 09:09:12 wien kernel: md: using 256k window.

[I can't say what those messages from the floppy driver have to mean
here]
[coincidence or real trouble? I don't know if someone tried to access
the]
[floppy in that situation...]

Jun 19 09:09:12 wien kernel: inserting floppy driver for 2.2.17
Jun 19 09:09:12 wien kernel: Floppy drive(s): fd0 is 1.44M
Jun 19 09:09:12 wien kernel: FDC 0 is a post-1991 82077
Jun 19 09:09:12 wien kernel: VFS: Disk change detected on device
fd(2,0)
Jun 19 09:09:12 wien kernel: end_request: I/O error, dev 02:00
(floppy), sector 0

Jun 19 09:10:07 wien kernel: attempt to access beyond end of device
Jun 19 09:10:07 wien kernel: 08:04: rw=0, want=1788379804,
limit=16860217
Jun 19 09:10:07 wien kernel: dev 09:00 blksize=1024 blocknr=447094950
sector=-718207696 size=4096 count=1
Jun 19 09:10:07 wien kernel: interrupting MD-thread pid 6
Jun 19 09:10:07 wien kernel: raid5: Disk failure on
scsi/host0/bus0/target0/lun0/part4, disabling device. Operation
continuing on 1 devices
Jun 19 09:10:07 wien kernel: attempt to access beyond end of device
Jun 19 09:10:07 wien kernel: 08:14: rw=0, want=1788379804,
limit=16860217
Jun 19 09:10:07 wien kernel: dev 09:00 blksize=1024 blocknr=447094950
sector=-718207696 size=4096 count=1
Jun 19 09:10:07 wien kernel: interrupting MD-thread pid 6
Jun 19 09:10:07 wien kernel: raid5: Disk failure on
scsi/host0/bus0/target1/lun0/part4, disabling device. Operation
continuing on 0 devices
Jun 19 09:10:07 wien kernel: raid5: restarting stripe 3576759600
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1431060822
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1734691872
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1330205776
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1328495717
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1769432423
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 168640078
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1414549308
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 542003017
Jun 19 09:10:07 wien kernel: md: updating md0 RAID superblock on
device
Jun 19 09:10:07 wien kernel: scsi/host0/bus0/target3/lun0/part4
[events: 00000025](write) scsi/host0/bus0/target3/lun0/part4's sb
offset: 16860096
Jun 19 09:10:07 wien kernel: (skipping faulty
scsi/host0/bus0/target2/lun0/part4 )
Jun 19 09:10:07 wien kernel: (skipping faulty
scsi/host0/bus0/target1/lun0/part4 )
Jun 19 09:10:07 wien kernel: (skipping faulty
scsi/host0/bus0/target0/lun0/part4 )
Jun 19 09:10:07 wien kernel: .
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 47112
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 47104
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 47113
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 47105 
[...]
[huge number of I/O error messages omitted]
[...]
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 0
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1769432423
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 4096
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1328495717
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 8192
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1330205776
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 12288
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 168640078
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 16384
Jun 19 09:10:07 wien kernel: raid5: md0: unrecoverable I/O error for
block 1414549308
Jun 19 09:10:07 wien kernel: EXT2-fs error (device md(9,0)):
ext2_readdir: directory #99426 contains a hole at offset 20480
[...]
[huge number of EXT2-fs error messages omitted]
[...]

Those error messages went on for the next six minutes
(producing thousands of log messages) until the admin 
at our customers site did an emergency reboot at 09:18:23

The system booted and our RAID came up that way:
[...]
Jun 19 09:18:23 wien kernel: (read)
scsi/host0/bus0/target0/lun0/part4's sb offset: 16860096 [events:
00000024]
Jun 19 09:18:23 wien kernel: (read)
scsi/host0/bus0/target1/lun0/part4's sb offset: 16860096 [events:
00000024]
Jun 19 09:18:23 wien kernel: (read)
scsi/host0/bus0/target2/lun0/part4's sb offset: 16860096 [events:
00000023]
Jun 19 09:18:23 wien kernel: (read)
scsi/host0/bus0/target3/lun0/part4's sb offset: 16860096 [events:
00000028]
Jun 19 09:18:23 wien kernel: autorun ...
Jun 19 09:18:23 wien kernel: considering
scsi/host0/bus0/target3/lun0/part4 ...
Jun 19 09:18:23 wien kernel:   adding
scsi/host0/bus0/target3/lun0/part4 ...
Jun 19 09:18:23 wien kernel:   adding
scsi/host0/bus0/target2/lun0/part4 ...
Jun 19 09:18:23 wien kernel:   adding
scsi/host0/bus0/target1/lun0/part4 ...
Jun 19 09:18:23 wien kernel:   adding
scsi/host0/bus0/target0/lun0/part4 ...
Jun 19 09:18:23 wien kernel: created md0
Jun 19 09:18:23 wien kernel:
bind<scsi/host0/bus0/target0/lun0/part4,1>
Jun 19 09:18:23 wien kernel:
bind<scsi/host0/bus0/target1/lun0/part4,2>
Jun 19 09:18:23 wien kernel:
bind<scsi/host0/bus0/target2/lun0/part4,3>
Jun 19 09:18:23 wien kernel:
bind<scsi/host0/bus0/target3/lun0/part4,4>
Jun 19 09:18:23 wien kernel: running:
<scsi/host0/bus0/target3/lun0/part4><scsi/host0/bus0/target2/lun0/part4><scsi/host0/bus0/target1/lun0/part4><scsi/host0/bus0/target0/lun0/part4>
Jun 19 09:18:23 wien kernel: now!
Jun 19 09:18:23 wien kernel: scsi/host0/bus0/target3/lun0/part4's
event counter: 00000028
Jun 19 09:18:23 wien kernel: scsi/host0/bus0/target2/lun0/part4's
event counter: 00000023
Jun 19 09:18:23 wien kernel: scsi/host0/bus0/target1/lun0/part4's
event counter: 00000024
Jun 19 09:18:23 wien kernel: scsi/host0/bus0/target0/lun0/part4's
event counter: 00000024
Jun 19 09:18:23 wien kernel: md: superblock update time inconsistency
-- using the most recent one
Jun 19 09:18:23 wien kernel: freshest:
scsi/host0/bus0/target3/lun0/part4
Jun 19 09:18:23 wien kernel: md: kicking non-fresh
scsi/host0/bus0/target2/lun0/part4 from array!
Jun 19 09:18:23 wien kernel:
unbind<scsi/host0/bus0/target2/lun0/part4,3>
Jun 19 09:18:23 wien kernel:
export_rdev(scsi/host0/bus0/target2/lun0/part4)
Jun 19 09:18:23 wien kernel: md: kicking non-fresh
scsi/host0/bus0/target1/lun0/part4 from array!
Jun 19 09:18:23 wien kernel:
unbind<scsi/host0/bus0/target1/lun0/part4,2>
Jun 19 09:18:23 wien kernel:
export_rdev(scsi/host0/bus0/target1/lun0/part4)
Jun 19 09:18:23 wien kernel: md: kicking non-fresh
scsi/host0/bus0/target0/lun0/part4 from array!
Jun 19 09:18:23 wien kernel:
unbind<scsi/host0/bus0/target0/lun0/part4,1>
Jun 19 09:18:23 wien kernel:
export_rdev(scsi/host0/bus0/target0/lun0/part4)
Jun 19 09:18:23 wien kernel: md0: removing former faulty
scsi/host0/bus0/target0/lun0/part4!
Jun 19 09:18:23 wien kernel: md0: removing former faulty
scsi/host0/bus0/target1/lun0/part4!
Jun 19 09:18:23 wien kernel: md0: removing former faulty
scsi/host0/bus0/target2/lun0/part4!
Jun 19 09:18:23 wien kernel: md0: kicking faulty
scsi/host0/bus0/target3/lun0/part4!
Jun 19 09:18:23 wien kernel:
unbind<scsi/host0/bus0/target3/lun0/part4,0>
Jun 19 09:18:23 wien kernel:
export_rdev(scsi/host0/bus0/target3/lun0/part4)
Jun 19 09:18:23 wien kernel: raid5 personality registered
Jun 19 09:18:23 wien kernel: md0: max total readahead window set to
256k
Jun 19 09:18:23 wien kernel: md0: 2 data-disks, max readahead per
data-disk: 128k
Jun 19 09:18:23 wien kernel: raid5: not enough operational devices for
md0 (3/3 failed)
Jun 19 09:18:23 wien kernel: RAID5 conf printout:
Jun 19 09:18:23 wien kernel:  --- rd:3 wd:0 fd:3
Jun 19 09:18:23 wien kernel:  disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 1, s:0, o:0, n:1 rd:1 us:1 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 3, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 4, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 5, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 6, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 7, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 8, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 9, s:0, o:0, n:0 rd:0 us:0 dev:[dev
00:00]
Jun 19 09:18:23 wien kernel:  disk 10, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:18:23 wien kernel:  disk 11, s:0, o:0, n:0 rd:0 us:0
dev:[dev 00:00]
Jun 19 09:18:23 wien kernel: raid5: failed to run raid set md0
Jun 19 09:18:23 wien kernel: pers->run() failed ...
Jun 19 09:18:23 wien kernel: do_md_run() returned -22
Jun 19 09:18:23 wien kernel: ... autorun DONE. 
[...]

Some more reboots gave exactly the same result. The RAID didn't come
up again. After an emergency call from them I logged into the machine 
and found basically the RAID _dead_. I tried to manually restart the
array, 
but it didn't come up again. After I examined the logfiles, I decided
to 
re-initialize the whole array, which worked fine. A complete restore 
of all the data from the backup _would_ have worked fine, also, if
they 
had followed our advises (several) to go for a bigger backup drive:
the backup tape contained all but about 6GB of the data, so now quite
a few important files (all the E-Mail for all employees there!) are
lost!
This is a lesson to learn, I think!

Now, apart from the backup problem (which isn't the fault of
the RAID subsystem), I'm asking myself if there is a problem
with the RAID code in Linux: there was only one disk temporarily
failing (the disk now works without problems for the past
days, but is of course to be replaced ASAP), we had a RAID5
with a spare disk, and apart from all that we lost the whole
array! Okay, one can say it was really bad luck, but shouldn't
the RAID code just ignore error messages from an already disabled
disk? Why did the RAID code disable all the other RAID devices
(starting around 09:10:07 in the logs), though there was no error 
message from the SCSI driver at that point for those other drives?
Is there any way to recover from that situation (apart from
re-initializing the whole array), given the fact that later all
4 disks were working again without problems?

In the past few days I did several tests with RAID5 arrays on
an Adaptec 29160 controller, with and without spare disks, even 
with just removing a hot-plug SCA disk from a running system. 
They all went fine, the spare-disk got re-synced into the array
without
problem, there wasn't a single failure of the array in about 5 or 6 
different test cases.

So, what went wrong with our customer's system? Any idea, anyone?
If anyone is interested, I can also send the complete logfile
from that event (in case I omitted any important information)

We have enough test-hardware here (even with SCA hot-plug disks),
so we can do some testing, if that would help (of course we can
not run tests on the machine that failed in the first place, as 
it is up and running in production again)

Regards,

- andreas

-- 
Andreas Haumer                     | mailto:[EMAIL PROTECTED]
*x Software + Systeme              | http://www.xss.co.at/
Karmarschgasse 51/2/20             | Tel: +43-1-6060114-0
A-1100 Vienna, Austria             | Fax: +43-1-6060114-71
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Reply via email to