I was running into some occasional oops messages from the kernel with a raid 1 set.
After some prolonged testing I finally was able to easily reproduce the problem with a specific sequence of commands.
Consider a running Raid 1 made up of hdc1 and hdd1. Size of the partitions doesn't seem to matter so I created them with 100Megs. IDE vs SCSI also doesn't seem to matter.
raidsetfaulty /dev/md0 /dev/hdc1
raidhotremove /dev/md0 /dev/hdc1
raidhotadd /dev/md0 /dev/hdc1
1 out of 3 tries I get an oops almost immediately somewhere in the middle of md_do_sync.
If I reboot the computer it will recognize the re-added drive and resync it correctly.
If I create a raid 5 with hdb1, hdc1, and hdd1 and perform the same test I do not seem to get the oops.
Preliminary research seems to point to md_do_sync():
run_task_queue(&tq_disk);
Screen capture at the oops:
md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 100 KB/sec.
md: using maximum available idle IO bandwith for reconstruction.
md: using 128k window.
Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<00000000>]
EFLAGS: 00010002
eax: 00000000 ebx: 00000246 ecx: 00000001 edx: c0256cdc
esi: c364bde0 edi: 00000080 ebp: c3efbfd8 esp: c3efbf38
ds: 0018 es: 0018 ss: 0018
Process mdrecoveryd (pid: 6, process nr: 6, stackpage=c3efb000)
Stack: c018f032 c0256cdc c3508340 c3769000 c0225f24 c3efbfd8 c0109d75 00000001
c01099d8 00000000 000003fd 000003f9 c023ed30 00000036 00000001 00000000
c02251ec 00000000 00000900 00000036 00001000 00001000 008d54fb 00000000
Call Trace: [<c018f032>] [<c0109d75>] [<c01099d8>] [<c018f6ca>] [<c018e38f>] [<c
0203208>] [<c01074e3>]
[<c018f5d8>]
Code: Bad EIP value.
I am no longer on the linux-raid mailing list because I can't seem to make Outlook send plain text messages. Please cc me on replies, and/or I will watch the archives for the linux-raid group.
I have to fix this soon, so I will post any other things I find.
