Vu> Here is my status of testing this patch.  On x86-64 system I
    Vu> got data corruption problem reported after ~4 hrs of running
    Vu> Engenio's Smash test tool when I tested with Engenio storage
    Vu> On ia64 system I got multiple async event 3
    Vu> (IB_EVENT_QP_ACCESS_ERR) and even 1 (IB_EVENT_QP_FATAL),
    Vu> finally the error handling path kicked in and the system
    Vu> paniced. Please see log below (I tested with Mellanox's srp
    Vu> target reference implementation - I don't see this error
    Vu> without the patch)

Hmm, that's interesting.  Did you see this type of problem with the
original FMR patch you wrote (and did you do this level of stress
testing)?  I'm wondering whether the issue is in the SRP driver, or
whether there is a bug in the FMR stuff at a lower level.


I stressed on x86_64 and did not see data corruption problem. I restarted the test with your patch without any problem till now ~15 hrs

When I tested with my original patch on ia64 I hit different problem


per[0]: Oops 8813272891392 [1]
Modules linked in: ib_srp ib_sa ib_cm ib_umad evdev joydev sg st sr_mod ide_cd cdrom usbserial parport_pc lp parport thermal processor ipv6 fan button ib_mthca ib_mad ib_core bd

Pid: 0, CPU 0, comm:              swapper
psr : 0000101008022038 ifs : 8000000000000003 ip : [<a0000001002f68f0>] Not tainted
ip is at __copy_user+0x890/0x960
unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003
rnat: e0000001fd1cbb64 bsps: a0000001008e9ef8 pr  : 80000000a96627a7
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001003019f0 b6  : a000000100003320 b7  : a000000100302120
f6  : 000000000000000000000 f7  : 1003eff23971ce39d6000
f8  : 1003ef840500400886000 f9  : 100068000000000000000
f10 : 10005fffffffff0000000 f11 : 1003e0000000000000080
r1  : a000000100ae8b50 r2  : 0d30315052534249 r3  : 0d3031505253424a
r8  : a000000100902570 r9  : 2d3031504db9c249 r10 : 0000000000544f53
r11 : e000000004998000 r12 : a0000001007bfb20 r13 : a0000001007b8000
r14 : a0007ffffdc00000 r15 : a000000100902540 r16 : a000000100902570
r17 : 0000000000000000 r18 : ffffffffffffffff r19 : e5c738e7c46c654d
r20 : e5c738e758000000 r21 : ff23971ce39d6000 r22 : c202802004430000
r23 : e0000001e2fafd78 r24 : 6203002002030000 r25 : e0000001e6fec18b
r26 : ffffffffffffff80 r27 : 0000000000000000 r28 : 0d30315052534000
r29 : 0000000000000001 r30 : ffffffffffffffff r31 : a0000001007480c8

Call Trace:
 [<a0000001000136a0>] show_stack+0x80/0xa0
                                sp=a0000001007bf6a0 bsp=a0000001007b94c0
 [<a000000100013f00>] show_regs+0x840/0x880
                                sp=a0000001007bf870 bsp=a0000001007b9460
 [<a000000100036fd0>] die+0x1b0/0x240
                                sp=a0000001007bf880 bsp=a0000001007b9418
 [<a00000010005a770>] ia64_do_page_fault+0x970/0xae0
                                sp=a0000001007bf8a0 bsp=a0000001007b93a8
 [<a00000010000be60>] ia64_leave_kernel+0x0/0x280
                                sp=a0000001007bf950 bsp=a0000001007b93a8
 [<a0000001002f68f0>] __copy_user+0x890/0x960
                                sp=a0000001007bfb20 bsp=a0000001007b9390
 [<a0000001003019f0>] unmap_single+0x90/0x2a0
                                sp=a0000001007bfb20 bsp=a0000001007b9388
 [<a0000001007bf960>] init_task+0x7960/0x8000
                                sp=a0000001007bfb20 bsp=a0000001007b90e0
 [<a0000001003019f0>] unmap_single+0x90/0x2a0
                                sp=a0000001007bfb20 bsp=a0000001007b8e38

What kind of HCAs were you using?  I assume on ia64 you're using
PCI-X, what about on x86-64?  PCIe or not?  Memfree or not?


PCI-X on ia64 and PCIe without mem on x86_64

Another thing that might be useful if it's convenient for you would be
to use an IB analyzer and trigger on a NAK to see what happens on the
wire around the IB_EVENT_QP_ACCESS_ERR.

I'll capture some log with analyzer when it's available

Vu
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to