Please help - when is a bad disk a bad disk?

Darren Nickerson Mon, 10 Apr 2000 14:34:37 -0700

Folks,

I've been running four 37GB IBM drives in a Raid5 array for several months now.
They're driven by two Promise Ultra66 ide controllers, and I'm running
2.2.13ac2 with Andre Hedrick's ide.2.2.13.19991111.patch (see another thread -
I'm considering upgrading but am not sure of the current state of the art).

Meet my drives:

hde: IBM-DPTA-373420, 32634MB w/1961kB Cache, CHS=66305/16/63, UDMA(66)
hdg: IBM-DPTA-373420, 32634MB w/1961kB Cache, CHS=66305/16/63, UDMA(66)
hdi: IBM-DPTA-373420, 32634MB w/1961kB Cache, CHS=66305/16/63, UDMA(66)
hdk: IBM-DPTA-373420, 32634MB w/1961kB Cache, CHS=66305/16/63, UDMA(66)

Occasionally I get a DMA dropout of one drive, followed by write errors, and
yet the drive stays in the array:

Apr 10 19:35:10 osmin kernel: hdg: timeout waiting for DMA 
Apr 10 19:35:10 osmin kernel: hdg: irq timeout: status=0xd0 { Busy } 
Apr 10 19:35:10 osmin kernel: hdg: DMA disabled 
Apr 10 19:35:12 osmin kernel: ide3: reset: success 

[snip irrelevant logs]

Apr 10 20:00:50 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:00:50 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Apr 10 20:00:50 osmin kernel: ide3: reset: success 
Apr 10 20:01:05 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:01:05 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Apr 10 20:01:05 osmin kernel: ide3: reset: success 
Apr 10 20:01:18 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:01:18 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Apr 10 20:01:18 osmin kernel: ide3: reset: success 
Apr 10 20:01:23 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:01:23 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 
Apr 10 20:01:23 osmin kernel: ide3: reset: success 
Apr 10 20:01:24 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:01:24 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 

which can go on for quite some time:

Apr 10 20:32:20 osmin kernel: ide3: reset: success 
Apr 10 20:37:47 osmin kernel: hdg: write_intr error2: nr_sectors=1, stat=0x58 
Apr 10 20:37:47 osmin kernel: hdg: write_intr: status=0x58 { DriveReady SeekComplete 
DataRequest } 

and yet hdg1 is still in the array:

[root@osmin src]# cat /proc/mdstat 
Personalities : [raid5] 
read_ahead 1024 sectors
md1 : active raid5 hdd1[4] hdk1[3] hdi1[2] hdg1[1] hde1[0] 100251264 blocks level 5, 
32k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>

my questions are:

1. how come the disk is hopeless without DMA? Something about UDMA I don't 
understand I guess?

2. the disk seems to be "cured" by re-enabling DMA . . . but what is the state 
of my array likely to be after the errors above? Can I safely assume this was 
harmless? I mean, they WERE write errors after all, yes? Is my array still in 
sync? Is there any way to tell other than by unmounting the array and fscking? 

3. is the failure simply not sufficiently severe to trigger removal from the 
array and hot reconstruction onto the host spare which is available?

4. is there some way to mark this disk bad right now, so that reconstruction 
is carried out from the disks I trust? I do have a hot spare . . . 

Thanks in advance for any advice.

-Darren
Please help - when is a bad disk a bad disk?

Reply via email to