I have found and fixed several bugs in the raid1 code relating to handling failed disks. First, I'll show the code in question and discuss each failure (as I understand them which may be wrong). Then I'll discuss my take on fixing the bugs. Last I'll include a patch to implement my fixes.
*******************************************************************************
int last_used = conf->last_used;
/*
* read balancing logic:
*/
mirror = conf->mirrors + last_used;
bh->b_rdev = mirror->dev;
sectors = bh->b_size >> 9;
First, the use of last_target in the read balancing code. The subroutine raid1_make_request uses last_target to as both the last, and the next target for a read. On entry, last_target is assumed to be valid and a read is issued to that target. If something happened (such as a device failure), then another read is likely issued to that drive. The drive failure code does not adjust last_drive.
*******************************************************************************
if (switch_disks && !conf->resync_mirrors) {
conf->sect_count = 0;
last_used = conf->last_used = mirror->next;
/*
* Do not switch to write-only disks ...
* reconstruction is in progress
*/
while (conf->mirrors[last_used].write_only)
conf->last_used = conf->mirrors[last_used].next;
}
Second, Also in raid1_make_request, after a read is issued, the last_drive might be changed if enough reads have already come from the one device. The code which chooses the new "last_device" uses a linked list of mirrored drives. No check is made to see if any of the drives in this list are operational. Thus if one of the drives has an error and becomes inoperable, we will still fire reads at the drive. The drive failure code does not adjust the linked list.
Third, the code to switch drives has a programming error. Note that "last_used" is part of the condition of the while, but inside the loop "conf->last_used" is changed. That's called an infinite loop because the condition to terminate can never change. I managed to hit this 3 times and I'm still not sure how.
*******************************************************************************
if (conf->working_disks == 1) {
/*
* Uh oh, we can do nothing if this is our last disk, but
* first check if this is a queued request for a device
* which has just failed.
*/
for (i = 0; i < disks; i++) {
if (mirrors[i].dev==dev && !mirrors[i].operational)
return 0;
}
printk (LAST_DISK);
}
Fourth, in raid1_error there is no code to handle a drive failing while being hot synced. Either the drive is left online despite errors, or working_disks gets decremented incorrectly.
-----------------------------------
Below is my attempt at fixing these problems. The line numbers are not exactly like the clean 2.2.16 build because I tend to remove allot of the extra printks from my code, and I'm too lazy to get a clean kernel and clean raid patch just to make a patch.
BTW, I'm not subscribed to this list, so any questions or comments will have to include me on the mailto.
Clay Jones
--- raid1.c 2000/09/09 12:25:46 1.6
+++ raid1.c 2000/10/14 22:22:14
@@ -250,6 +250,9 @@
if (rw==READ) {
int last_used = conf->last_used;
+ while (!conf->mirrors[last_used].operational || conf->mirrors[l
ast_used].write_only)
+ last_used = conf->last_used = conf->mirrors[last_used].next
;
+
/*
* read balancing logic:
*/
@@ -274,8 +277,8 @@
* Do not switch to write-only disks ...
* reconstruction is in progress
*/
- while (conf->mirrors[last_used].write_only)
- conf->last_used = conf->mirrors[last_used].next;
+ while (!conf->mirrors[last_used].operational || conf->mi
rrors[last_used].write_only)
+ last_used = conf->last_used = conf->mirrors[last
_used].next;
}
bh_req = &r1_bh->bh_req;
memcpy(bh_req, bh, sizeof(*bh));
@@ -431,30 +434,50 @@
{
raid1_conf_t *conf = mddev_to_conf(mddev);
struct mirror_info * mirrors = conf->mirrors;
+ mdp_super_t *sb = mddev->sb;
int disks = MD_SB_DISKS;
int i;
- if (conf->working_disks == 1) {
+ /* find the device */
+ for (i = 0; i < disks; i++) {
+ if (mirrors[i].dev==dev)
+ break;
+ }
+
+ /* if we can't find the device or the device isn't operational, then no
harm, no foul */
+ if(i == disks || !mirrors[i].operational)
+ return 0;
+
+#if 0
+ if(mirrors[i].write_only)
+ {
+ mirrors[i].operational = 0;
+ mirrors[i].write_only = 0;
+ conf->spare = NULL;
+ mark_disk_faulty(sb->disks+mirrors[i].number);
+ mark_disk_nonsync(sb->disks+mirrors[i].number);
+ mark_disk_inactive(sb->disks+mirrors[i].number);
+ sb->spare_disks--;
+ sb->working_disks--;
+ sb->failed_disks++;
+
+ return -EIO;
+ }
+#endif
+
+ if (conf->working_disks == 1)
+ {
/*
* Uh oh, we can do nothing if this is our last disk, but
* first check if this is a queued request for a device
* which has just failed.
*/
- for (i = 0; i < disks; i++) {
- if (mirrors[i].dev==dev && !mirrors[i].operational)
- return 0;
- }
printk (LAST_DISK);
} else {
/*
* Mark disk as unusable
*/
- for (i = 0; i < disks; i++) {
- if (mirrors[i].dev==dev && mirrors[i].operational) {
- mark_disk_bad (mddev, i);
- break;
- }
- }
+ mark_disk_bad (mddev, i);
}
return 0;
}
