date:20070923

raid5 - which disk failed ?

2007-09-23 Thread Rainer Fuegenstein


Hi,

I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA
mainboard, running centos 5.0. a few days ago the server started to
reboot or freeze occasionally, after reboot md always starts a resync
of the raid:
$ cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0]
  1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] []
  []  resync =  0.9% (3819132/390708736) 
finish=366.2min speed=17603K/sec

unused devices: none

after about an hour, the server freezes again. I figured out that
about this time the following errors are reported in the messages log:

Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007
Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
SeekComplete Error }
Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
}, LBAsect=254106015, high=15, low=2447775, sector=254106015
Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015
Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
SeekComplete Error }
Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
}, LBAsect=254106023, high=15, low=2447783, sector=254106023
Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023
Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
SeekComplete Error }
Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
}, LBAsect=254106031, high=15, low=2447791, sector=254106031
Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031
Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
SeekComplete Error }
Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
}, LBAsect=254106039, high=15, low=2447799, sector=254106039
Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039
Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21
Sep 23 22:23:53 alfred kernel: hde: DMA timeout error
Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady 
SeekComplete DataRequest }
Sep 23 22:28:40 alfred kernel: ide2: BM-DMA at 0x7800-0x7807, BIOS 
settings: hde:DMA, hdf:pio

now there are two things that puzzle me:

1) when md starts a resync of the array, shouldn't one drive be marked
as down [_UUU] in mdstat instead of reporting it as [] ? or, the
other way round: is hde really the faulty drive ? how can I make sure
I'm removing and replacing the proper drive ?

2) can a faulty drive in a raid5 really crash the whole server ? maybe
it's because of the bug in the onboard promise controller that adds to
this problem (see attachment for dmesg output).

tia.

dmesg
Description: Binary data

Re: raid5 - which disk failed ?

2007-09-23 Thread Richard Scobie


Rainer Fuegenstein wrote:



1) when md starts a resync of the array, shouldn't one drive be marked
as down [_UUU] in mdstat instead of reporting it as [] ? or, the
other way round: is hde really the faulty drive ? how can I make sure
I'm removing and replacing the proper drive ?


If it is not already, install smartmontools.

It certainly looks like hde is failing, so a smartctl -a /dev/hde should 
give you some idea. You will find it also gives you the serial number of 
the drive, which will be attached to a label on the drive, allowing you 
to locate it.


Regards,

Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 - which disk failed ?

2007-09-23 Thread Neil Brown

On Monday September 24, [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA
 mainboard, running centos 5.0. a few days ago the server started to
 reboot or freeze occasionally, after reboot md always starts a resync
 of the raid:
 $ cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0]
   1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] []
   []  resync =  0.9% (3819132/390708736) 
 finish=366.2min speed=17603K/sec

This is normal.  If there was any write activity in the few hundred
milliseconds before a crash, you need to resync because the parity of
the stripe being written could not incorrect.

 
 after about an hour, the server freezes again. I figured out that
 about this time the following errors are reported in the messages log:
 
 Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 
 254106007
 Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
 SeekComplete Error }
 Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
 }, LBAsect=254106015, high=15, low=2447775, sector=254106015
 Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 
 254106015
 Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
 SeekComplete Error }
 Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
 }, LBAsect=254106023, high=15, low=2447783, sector=254106023
 Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 
 254106023
 Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
 SeekComplete Error }
 Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
 }, LBAsect=254106031, high=15, low=2447791, sector=254106031
 Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 
 254106031
 Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady 
 SeekComplete Error }
 Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError 
 }, LBAsect=254106039, high=15, low=2447799, sector=254106039
 Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 
 254106039
 Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21
 Sep 23 22:23:53 alfred kernel: hde: DMA timeout error
 Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { 
 DriveReady SeekComplete DataRequest }
 Sep 23 22:28:40 alfred kernel: ide2: BM-DMA at 0x7800-0x7807, BIOS 
 settings: hde:DMA, hdf:pio

Something definitely sick there.

 
 now there are two things that puzzle me:
 
 1) when md starts a resync of the array, shouldn't one drive be marked
 as down [_UUU] in mdstat instead of reporting it as [] ? or, the
 other way round: is hde really the faulty drive ? how can I make sure
 I'm removing and replacing the proper drive ?

When a drive fail, md records that failure in the metadata on the
other devices in the array.
The fact that the drive is not marked as failed after the reboot
suggests that md failed to update the metadata of the good drives.
Maybe it is the controller that is failing rather than a drive, and it
cannot write to anything at this point.
Or maybe the drive is failing, but that is badly confusing the
controller, with the same result.
Is it always hde that is reporting errors?

With PATA, it is fairly easy to make sure you have removed the correct
drive, and names don't change.  hde is the 'master' on the 3rd
channel.  Presumably the first channel of your controller card.

Just disconnect the drive you think it is, reboot, and see if hde is
still there.

 
 2) can a faulty drive in a raid5 really crash the whole server ? maybe
 it's because of the bug in the onboard promise controller that adds to
 this problem (see attachment for dmesg output).

No, a faulty drive in a raid5 should not crash the whole server.  But
a bad controller card or buggy driver for the controller could.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [mdadm] Add klibc support to mdadm.h

2007-09-23 Thread Neil Brown


Thanks for this and the other 2 patches.  They are all in the mdadm
.git.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.3 segfaults on assembly (v1 superblocks)

2007-09-23 Thread Neil Brown

On Friday September 7, [EMAIL PROTECTED] wrote:
 
 Neil, could this be a bug?
 

Sure could.  Thanks for the report.

This patch (already in .git) should fix it.

NeilBrown

---
Don't corrupt 'supertype' when speculatively calling load_super1

When load_super1 is trying to see which sub-version of v1 superblock
is present, failure will cause it to clear st-ss, which is not good.

So use a temporary 'super_type' for the 'test if this version works'
calls, then copy that into 'st' on success.

### Diffstat output
 ./super1.c |   19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff .prev/super1.c ./super1.c
--- .prev/super1.c  2007-09-24 14:26:19.0 +1000
+++ ./super1.c  2007-09-24 14:23:11.0 +1000
@@ -996,34 +996,35 @@ static int load_super1(struct supertype 
 
if (st-ss == NULL || st-minor_version == -1) {
int bestvers = -1;
+   struct supertype tst;
__u64 bestctime = 0;
/* guess... choose latest ctime */
-   st-ss = super1;
-   for (st-minor_version = 0; st-minor_version = 2 ; 
st-minor_version++) {
+   tst.ss = super1;
+   for (tst.minor_version = 0; tst.minor_version = 2 ; 
tst.minor_version++) {
switch(load_super1(st, fd, sbp, devname)) {
case 0: super = *sbp;
if (bestvers == -1 ||
bestctime  __le64_to_cpu(super-ctime)) {
-   bestvers = st-minor_version;
+   bestvers = tst.minor_version;
bestctime = __le64_to_cpu(super-ctime);
}
free(super);
*sbp = NULL;
break;
-   case 1: st-ss = NULL; return 1; /*bad device */
+   case 1: return 1; /*bad device */
case 2: break; /* bad, try next */
}
}
if (bestvers != -1) {
int rv;
-   st-minor_version = bestvers;
-   st-ss = super1;
-   st-max_devs = 384;
+   tst.minor_version = bestvers;
+   tst.ss = super1;
+   tst.max_devs = 384;
rv = load_super1(st, fd, sbp, devname);
-   if (rv) st-ss = NULL;
+   if (rv == 0)
+   *st = tst;
return rv;
}
-   st-ss = NULL;
return 2;
}
if (!get_dev_size(fd, devname, dsize))
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

raid5 - which disk failed ?

Re: raid5 - which disk failed ?

Re: raid5 - which disk failed ?

Re: [PATCH] [mdadm] Add klibc support to mdadm.h

Re: mdadm 2.6.3 segfaults on assembly (v1 superblocks)

5 matches

Site Navigation

Mail list logo

Footer information