sofware raid5 oops

2005-07-11 Thread Farkas Levente

anybody has any useful tip about it?
yours.

 Original Message 
hi,
after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one
of our server always crash once a week without any oops. this happneds
with both the normal kernel-2.6.9-11.EL and
kernel-2.6.9-11.106.unsupported. after we change the motherboard, the
raid contorller and the cables too we still got it. finally we start
netdump and last but not least yesterday we got a crash log and a core
file. it seems there is a bug in the raid5 code of the kernel.
this is our backup server with 8 x 200GB hdd in a raid5 (for the data)
plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid
contorller, running. i attached the netdump log of the last crash.
how can i fix it?
yours.


--
  Levente   Si vis pacem para bellum!
RAID5 conf printout:
 --- rd:8 wd:8 fd:0
 disk 0, o:1, dev:sda1
 disk 1, o:1, dev:sdb1
 disk 2, o:1, dev:sdc1
 disk 3, o:1, dev:sdd1
 disk 4, o:1, dev:sde1
 disk 5, o:1, dev:sdf1
 disk 6, o:1, dev:sdg1
 disk 7, o:1, dev:sdh1
Unable to handle kernel NULL pointer dereference at virtual address 
 printing eip:

*pde = 0f94a067
Oops:  [#1]
Modules linked in: cifs nls_utf8 ncpfs nfsd exportfs lockd sunrpc parport_pc lp 
parport netconsole netdump i2c_dev i2c_core ipx dm_mod e1000 tg3 floppy ext3 
jbd raid5 xor raid1 3w_ sd_mod scsi_mod
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.9-11.106.unsupported) 
EIP is at 0x0
eax: c1806138   ebx: c018961c   ecx: 0016   edx: c035c7f4
esi: e7182200   edi: 0001   ebp: c18fb380   esp: f7878f34
ds: 007b   es: 007b   ss: 0068
Process md2_raid5 (pid: 224, threadinfo=f7878000 task=f7872600)
Stack: f7b973c0 f8879a26  md_thread+0x20d/0x23a
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c030ce1a] ret_from_fork+0x6/0x14
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c02a183f] md_thread+0x0/0x23a
 [c01041d9] kernel_thread_helper+0x5/0xb
Code:  Bad EIP value.

Pid: 224, comm:md2_raid5
EIP: 0060:[] CPU: 0
EIP is at 0x0
 EFLAGS: 00010246Not tainted  (2.6.9-11.106.unsupported)
EAX: c1806138 EBX: c018961c ECX: 0016 EDX: c035c7f4
ESI: e7182200 EDI: 0001 EBP: c18fb380 DS: 007b ES: 007b
CR0: 8005003b CR2: ffd5 CR3: 0fd6b000 CR4: 06d0
 [f8879a26] handle_stripe+0xfca/0x1207 [raid5]
 [f887a7d5] raid5d+0x197/0x2ab [raid5]
 [c02a1a4c] md_thread+0x20d/0x23a
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c030ce1a] ret_from_fork+0x6/0x14
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c02a183f] md_thread+0x0/0x23a
 [c01041d9] kernel_thread_helper+0x5/0xb

   sibling
  task PC  pid father child younger older
init  S C01458E9   920 1  0 2   (NOTLB)
f7f44eb0 0086 0055  xfrm_state_flush+0x2/0x289 tcp_poll+0x31/0x144
 [c01768e1] do_select+0x347/0x378
 [c0176461] __pollwait+0x0/0x94
 [c0176c05] sys_select+0x2e0/0x43a
 [c030cefb] syscall_call+0x7/0xb
ntpd  S 00D0  2516  2196  1  2219  2172 (NOTLB)
f0885eb0 0082 0246 00d0 cf8553a0 21cd 5197434f 3abd 
   f697d2a0 f697d42c f6a0d580 7fff  f0885f74 c030b7e5 f69a5980 
    f0885f58 f69a5980 f106ed18 c017648e 0246 f106b800 f0885f58 
Call Trace:
 [c030b7e5] schedule_timeout+0x50/0x10c
 [c017648e] __pollwait+0x2d/0x94
 [c02aeeac] datagram_poll+0x25/0xd1
 [c01768e1] do_select+0x347/0x378
 [c0176461] __pollwait+0x0/0x94
 [c0176c05] sys_select+0x2e0/0x43a
 [c01058d8] sys_sigreturn+0x1ce/0x1f2
 [c030cefb] syscall_call+0x7/0xb
rpc.rquotad   S   3416  2219  1  2223  2196 (NOTLB)
f65a4f1c 0082 0001  f697ccd0 00030ee2 966e2a43 0033 
   f69b1320 f69b14ac  7fff f65a4fa0 f0888ba0 c030b7e5 f106b580 
   f65a4fa0 f106e518 c02aeeac f6a7a780 c03806c0 0145 f0888bb0 0001 
Call Trace:
 [c030b7e5] schedule_timeout+0x50/0x10c
 [c02aeeac] datagram_poll+0x25/0xd1
 [c02a8f2c] sock_poll+0x12/0x14
 [c0176db3] do_pollfd+0x54/0x77
 [c0176e63] do_poll+0x8d/0xab
 [c0177020] sys_poll+0x19f/0x24f
 [c0176461] __pollwait+0x0/0x94
 [c030cefb] syscall_call+0x7/0xb
nfsd  S FF4DCFB0  2316  2223  1  2224  2219 (L-TLB)
f05fff10 0046 0002 ff4dcfb0 f69c4dd0 13cc c85b7bd5 37c8 
   f69b0d50 f69b0edc 03db8cae 03db8cae 000b c1993c00 c030b886 f5675f18 
   c035b0d0 03db8cae 1d244b3c  0005 c031a0b5 c031c25c 00a8 
Call Trace:
 [c030b886] schedule_timeout+0xf1/0x10c
 [c0129336] process_timeout+0x0/0x5
 [f8ade6bb] svc_recv+0x325/0x65b [sunrpc]
 [c011b856] default_wake_function+0x0/0xc
 [c011b921] __wake_up+0x6e/0xca
 [c011b856] default_wake_function+0x0/0xc
 [c012d587] sigprocmask+0x140/0x1f4
 [f8b2e44d] nfsd+0x1ae/0x540 [nfsd]
 [f8b2e29f] nfsd+0x0/0x540 [nfsd]
 [c01041d9] kernel_thread_helper+0x5/0xb
nfsd  S 37C3  3472  2224  1  2225  2223 (L-TLB)
f1513f10 

Re: sofware raid5 oops

2005-07-11 Thread Paul Clements

Farkas Levente wrote:

anybody has any useful tip about it?



Unable to handle kernel NULL pointer dereference at virtual address 
 printing eip:

*pde = 0f94a067
Oops:  [#1]
Modules linked in: cifs nls_utf8 ncpfs nfsd exportfs lockd sunrpc parport_pc lp 
parport netconsole netdump i2c_dev i2c_core ipx dm_mod e1000 tg3 floppy ext3 
jbd raid5 xor raid1 3w_ sd_mod scsi_mod
CPU:0
EIP:0060:[]Not tainted VLI
EFLAGS: 00010246   (2.6.9-11.106.unsupported) 
EIP is at 0x0

eax: c1806138   ebx: c018961c   ecx: 0016   edx: c035c7f4
esi: e7182200   edi: 0001   ebp: c18fb380   esp: f7878f34
ds: 007b   es: 007b   ss: 0068
Process md2_raid5 (pid: 224, threadinfo=f7878000 task=f7872600)
Stack: f7b973c0 f8879a26  md_thread+0x20d/0x23a
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c030ce1a] ret_from_fork+0x6/0x14
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c02a183f] md_thread+0x0/0x23a
 [c01041d9] kernel_thread_helper+0x5/0xb
Code:  Bad EIP value.

Pid: 224, comm:md2_raid5
EIP: 0060:[] CPU: 0
EIP is at 0x0
 EFLAGS: 00010246Not tainted  (2.6.9-11.106.unsupported)
EAX: c1806138 EBX: c018961c ECX: 0016 EDX: c035c7f4
ESI: e7182200 EDI: 0001 EBP: c18fb380 DS: 007b ES: 007b
CR0: 8005003b CR2: ffd5 CR3: 0fd6b000 CR4: 06d0
 [f8879a26] handle_stripe+0xfca/0x1207 [raid5]
 [f887a7d5] raid5d+0x197/0x2ab [raid5]
 [c02a1a4c] md_thread+0x20d/0x23a
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c030ce1a] ret_from_fork+0x6/0x14
 [c011ceaf] autoremove_wake_function+0x0/0x2d
 [c02a183f] md_thread+0x0/0x23a
 [c01041d9] kernel_thread_helper+0x5/0xb


We're in handle_stripe with an EIP of 0. Perhaps a NULL end io function 
in the following:


(raid5.c, line ~1252):

while ((bi=return_bi)) {
int bytes = bi-bi_size;

return_bi = bi-bi_next;
bi-bi_next = NULL;
bi-bi_size = 0;
bi-bi_end_io(bi, bytes, 0);
}


Is it valid to assume that bi_end_io is non-NULL in this context?

--
Paul
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html