Re[2]: raid5 - which disk failed ?
NB Or maybe the drive is failing, but that is badly confusing the NB controller, with the same result. NB Is it always hde that is reporting errors? for now - yes; but a few months ago for a short period of time hdg and hdh also have been reported with errors, but this went away quickly and never occured again. NB With PATA, it is fairly easy to make sure you have removed the correct NB drive, and names don't change. hde is the 'master' on the 3rd NB channel. Presumably the first channel of your controller card. I know; what I meant was: I'd like to make sure that I remove the one drive that md thinks is faulty - I want to avoid removing a healthy drive, leaving md with one broken drive and two healthy ones which isn't good for a raid5. but in this case, hde rather certainly is the troublemaker. NB No, a faulty drive in a raid5 should not crash the whole server. But NB a bad controller card or buggy driver for the controller could. this seems to be the case here. guess its time to shop for a new server. tnx. -- Rainer Fuegenstein [EMAIL PROTECTED] -- Why are you looking into the darkness and not into the fire as we do ?, Nell asked. Because the darkness is where danger comes from, Peter said, and from the fire comes only illusion. (from The Diamond Age) -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid5 - which disk failed ?
Hi, I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA mainboard, running centos 5.0. a few days ago the server started to reboot or freeze occasionally, after reboot md always starts a resync of the raid: $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0] 1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [] [] resync = 0.9% (3819132/390708736) finish=366.2min speed=17603K/sec unused devices: none after about an hour, the server freezes again. I figured out that about this time the following errors are reported in the messages log: Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007 Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015 Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015 Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023 Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023 Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031 Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031 Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039 Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039 Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21 Sep 23 22:23:53 alfred kernel: hde: DMA timeout error Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 23 22:28:40 alfred kernel: ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio now there are two things that puzzle me: 1) when md starts a resync of the array, shouldn't one drive be marked as down [_UUU] in mdstat instead of reporting it as [] ? or, the other way round: is hde really the faulty drive ? how can I make sure I'm removing and replacing the proper drive ? 2) can a faulty drive in a raid5 really crash the whole server ? maybe it's because of the bug in the onboard promise controller that adds to this problem (see attachment for dmesg output). tia. dmesg Description: Binary data
Re: raid5 - which disk failed ?
Rainer Fuegenstein wrote: 1) when md starts a resync of the array, shouldn't one drive be marked as down [_UUU] in mdstat instead of reporting it as [] ? or, the other way round: is hde really the faulty drive ? how can I make sure I'm removing and replacing the proper drive ? If it is not already, install smartmontools. It certainly looks like hde is failing, so a smartctl -a /dev/hde should give you some idea. You will find it also gives you the serial number of the drive, which will be attached to a label on the drive, allowing you to locate it. Regards, Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 - which disk failed ?
On Monday September 24, [EMAIL PROTECTED] wrote: Hi, I'm using a raid 5 with 4*400 GB PATA disks on a rather old VIA mainboard, running centos 5.0. a few days ago the server started to reboot or freeze occasionally, after reboot md always starts a resync of the raid: $ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdh1[3] hdg1[2] hdf1[1] hde1[0] 1172126208 blocks level 5, 64k chunk, algorithm 2 [4/4] [] [] resync = 0.9% (3819132/390708736) finish=366.2min speed=17603K/sec This is normal. If there was any write activity in the few hundred milliseconds before a crash, you need to resync because the parity of the stripe being written could not incorrect. after about an hour, the server freezes again. I figured out that about this time the following errors are reported in the messages log: Sep 23 22:23:05 alfred kernel: end_request: I/O error, dev hde, sector 254106007 Sep 23 22:23:09 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:09 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106015, high=15, low=2447775, sector=254106015 Sep 23 22:23:09 alfred kernel: end_request: I/O error, dev hde, sector 254106015 Sep 23 22:23:14 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:14 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106023, high=15, low=2447783, sector=254106023 Sep 23 22:23:14 alfred kernel: end_request: I/O error, dev hde, sector 254106023 Sep 23 22:23:18 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:18 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106031, high=15, low=2447791, sector=254106031 Sep 23 22:23:18 alfred kernel: end_request: I/O error, dev hde, sector 254106031 Sep 23 22:23:23 alfred kernel: hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Sep 23 22:23:23 alfred kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=254106039, high=15, low=2447799, sector=254106039 Sep 23 22:23:23 alfred kernel: end_request: I/O error, dev hde, sector 254106039 Sep 23 22:23:43 alfred kernel: hde: dma_timer_expiry: dma status == 0x21 Sep 23 22:23:53 alfred kernel: hde: DMA timeout error Sep 23 22:23:53 alfred kernel: hde: dma timeout error: status=0x58 { DriveReady SeekComplete DataRequest } Sep 23 22:28:40 alfred kernel: ide2: BM-DMA at 0x7800-0x7807, BIOS settings: hde:DMA, hdf:pio Something definitely sick there. now there are two things that puzzle me: 1) when md starts a resync of the array, shouldn't one drive be marked as down [_UUU] in mdstat instead of reporting it as [] ? or, the other way round: is hde really the faulty drive ? how can I make sure I'm removing and replacing the proper drive ? When a drive fail, md records that failure in the metadata on the other devices in the array. The fact that the drive is not marked as failed after the reboot suggests that md failed to update the metadata of the good drives. Maybe it is the controller that is failing rather than a drive, and it cannot write to anything at this point. Or maybe the drive is failing, but that is badly confusing the controller, with the same result. Is it always hde that is reporting errors? With PATA, it is fairly easy to make sure you have removed the correct drive, and names don't change. hde is the 'master' on the 3rd channel. Presumably the first channel of your controller card. Just disconnect the drive you think it is, reboot, and see if hde is still there. 2) can a faulty drive in a raid5 really crash the whole server ? maybe it's because of the bug in the onboard promise controller that adds to this problem (see attachment for dmesg output). No, a faulty drive in a raid5 should not crash the whole server. But a bad controller card or buggy driver for the controller could. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html