I have been putting raid recovery (raid 1 and 5) through it's paces by
marking various partitions failed (using raidsetfaulty)

Then doing raidhotremove and raidhotadd to either restart the same
partition, or add a new partition.

I came across a condition which caused a kernel oops 3 out of 4 tries
with a call to address 0.

Create a raid 1
fail one partition
remove it
add a partition on an IDE hard disk.

I could not get a failure when adding a partition on a SCSI device, not
have I seen a failure on a raid 5.

There are no glaring differences between HOT_ADD on raid 1 and 5.

After examining the oops, I determined that it was always happening in
md_do_sync when a call was being made to run_task_queue(&tq_disk).

I struggled for some time to understand the reason for calling
run_task_queue at this point and decided it must be for speed.  I
commented out both run_task_queue calls, recompiled.  The problem seems
to have gone away and no measurable performance decrease.

I see that run_task_queue has re-entrance protection otherwise I would
suspect an interrupt serviceing the run_task_queue as well.

Either way, I am running without the calls to run_task_queue and will
continue testing for other failures in the recovery path.

One Final note.  After the oops, I would cycle power on the computer. 
When starting up, raid 1 considered the raid array to be in sync, even
though the sync never occurred.  I saw the area where the SuperBlock was
being updated, and I will check to see if the SB is updated correctly.

Clay

Reply via email to