[dpdk-dev] Kernel panic in KNI

2016-04-07 Thread Matt Laswell
Hey Robert,

Thanks for the insight.  I work with Jay on the code he's asking about; we
only have one mbuf pool that we use for all packets.  Mostly, this is for
the reasons that you describe, as well as for the sake of simplicity.  As
it happens, the stack trace we're seeing makes it look as though either the
mbuf's data pointer is screwed up, or the VA translation done on it is.  I
suspect that we're getting to a failure mode similar to the one you
experienced, though perhaps for different reasons.

Thanks,
Matt

On Wed, Apr 6, 2016 at 5:30 PM, Sanford, Robert  wrote:

> Hi Jay,
>
> I won't try to interpret your kernel stack trace. But, I'll tell you about
> a KNI-related problem that we once experienced, and the symptom was a
> kernel hang.
>
> The problem was that we were passing mbufs allocated out of one mempool,
> to a KNI context that we had set up with a different mempool (on a
> different CPU socket). The KNI kernel driver, converts the user-space mbuf
> virtual address (VA) to a kernel VA by adding the difference between the
> user and kernel VAs of the mempool used to create the KNI context. So, if
> an mbuf comes from a different mempool, the calculated address will
> probably be VERY BAD.
>
> Could this be your problem?
>
> --
> Robert
>
>
> On 4/6/16 4:16 PM, "Jay Rolette"  wrote:
>
> >I had a system lockup hard a couple of days ago and all we were able to
> >get
> >was a photo of the LCD monitor with most of the kernel panic on it. No way
> >to scroll back the buffer and nothing in the logs after we rebooted. Not
> >surprising with a kernel panic due to an exception during interrupt
> >processing. We have a serial console attached in case we are able to get
> >it
> >to happen again, but it's not easy to reproduce (hours of runtime for this
> >instance).
> >
> >Ran the photo through OCR software to get a text version of the dump, so
> >possible I missed some fixups in this:
> >
> >[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI:
> >a12520669126180a
> >[39178.464020] RBP: 880433966970 R08: a12520669126180a R09:
> >881fd2f35000
> >[39178.495091] R10:  R11: 881fd2f88000 R12:
> >883fdla75ee8
> >[39178.526594] R13: 00ba R14: 7fdad5a66780 R15:
> >883715ab6780
> >[39178.559011] FS:  77fea740() GS:88lfffc0()
> >knlGS:
> >[39178.592005] CS:  0010 DS:  ES:  CR0: 80050033
> >[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4:
> >001407f0
> >[39178.656187] Stack:
> >[39178.689025] c067c7ef 00ba 00ba
> >881fd2f88000
> >[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8
> >8804339bb9c8
> >[39178.756525] 81658456 881fcd2ec40c c0680700
> >880436bad800
> >[39178.790577] Call Trace:
> >[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni]
> >[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0
> >[39178.893426] [] sch_direct_xmit+0xee/0xic0
> >[39178.927435] [l __dev_queue_xmit+0x200/0x4d0
> >[39178.961684] [l dev_queue_xmit+0x10/0x20
> >[39178.996194] [] neigh_connected_output+0x67/0x100
> >[39179.031098] [] ip_finish_output+0xid8/0x850
> >[39179.066709] [l ip_output+0x58/0x90
> >[39179.101551] [] ip_local_out_sk+0x30/0x40
> >[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0
> >[39179.171742] [] tcp_transmit_skb+0x47c/0x900
> >[39179.206854] [l tcp_write_xmit+0x110/0xcb0
> >[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0
> >[39179.277632] [] tcp_push+0xec/0x120
> >[39179.311768] [] tcp_sendmsg+0xb9/0xce0
> >[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0
> >[39179.385586] [] inet_sendmsg+0x64/0x60
> >[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30
> >[39179.4586581 [] sock_sendmsg+0x86/0xc0
> >[39179.493220] [] ? __inet_stream_connect+0xa5/0x320
> >[39179.528033] [] ? __fdget+0x13/0x20
> >[39179.561214] [] SYSC_sendto+0x121/0x1c0
> >[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150
> >[39179.6268931 [] ? read_tsc+0x9/0x20
> >[39179.6586541 [] ? ktime_get_ts+0x48/0xe0
> >[39179.689944] [] SyS_sendto+0xe/0x10
> >[39179.719575] [] system_call_fastpath+0xia/0xif
> >[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
> >00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
> > 90 48 89 f8 48 89 d1  a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20
> >4c
> >8b % 4c 86
> >[39179.808690] RIP  [] memcpy+0x6/0x110
> >[39179.837238]  RSP 
> >[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
> >[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
> >[39179.992896] Kernel Offset: 0x0 from 0x8100 (relocation
> >range: 0x8000-0xbfff)
> >[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
> >interrupt
> >
> >It blew up when kni_net_tx() called memcpy() to copy data from the skb to
> >an mbuf.
> >
> >Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
> >needed. Plenty of experience doing RTOS and 

[dpdk-dev] Kernel panic in KNI

2016-04-06 Thread Sanford, Robert
Hi Jay,

I won't try to interpret your kernel stack trace. But, I'll tell you about
a KNI-related problem that we once experienced, and the symptom was a
kernel hang.

The problem was that we were passing mbufs allocated out of one mempool,
to a KNI context that we had set up with a different mempool (on a
different CPU socket). The KNI kernel driver, converts the user-space mbuf
virtual address (VA) to a kernel VA by adding the difference between the
user and kernel VAs of the mempool used to create the KNI context. So, if
an mbuf comes from a different mempool, the calculated address will
probably be VERY BAD.

Could this be your problem?

--
Robert


On 4/6/16 4:16 PM, "Jay Rolette"  wrote:

>I had a system lockup hard a couple of days ago and all we were able to
>get
>was a photo of the LCD monitor with most of the kernel panic on it. No way
>to scroll back the buffer and nothing in the logs after we rebooted. Not
>surprising with a kernel panic due to an exception during interrupt
>processing. We have a serial console attached in case we are able to get
>it
>to happen again, but it's not easy to reproduce (hours of runtime for this
>instance).
>
>Ran the photo through OCR software to get a text version of the dump, so
>possible I missed some fixups in this:
>
>[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI:
>a12520669126180a
>[39178.464020] RBP: 880433966970 R08: a12520669126180a R09:
>881fd2f35000
>[39178.495091] R10:  R11: 881fd2f88000 R12:
>883fdla75ee8
>[39178.526594] R13: 00ba R14: 7fdad5a66780 R15:
>883715ab6780
>[39178.559011] FS:  77fea740() GS:88lfffc0()
>knlGS:
>[39178.592005] CS:  0010 DS:  ES:  CR0: 80050033
>[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4:
>001407f0
>[39178.656187] Stack:
>[39178.689025] c067c7ef 00ba 00ba
>881fd2f88000
>[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8
>8804339bb9c8
>[39178.756525] 81658456 881fcd2ec40c c0680700
>880436bad800
>[39178.790577] Call Trace:
>[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni]
>[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0
>[39178.893426] [] sch_direct_xmit+0xee/0xic0
>[39178.927435] [l __dev_queue_xmit+0x200/0x4d0
>[39178.961684] [l dev_queue_xmit+0x10/0x20
>[39178.996194] [] neigh_connected_output+0x67/0x100
>[39179.031098] [] ip_finish_output+0xid8/0x850
>[39179.066709] [l ip_output+0x58/0x90
>[39179.101551] [] ip_local_out_sk+0x30/0x40
>[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0
>[39179.171742] [] tcp_transmit_skb+0x47c/0x900
>[39179.206854] [l tcp_write_xmit+0x110/0xcb0
>[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0
>[39179.277632] [] tcp_push+0xec/0x120
>[39179.311768] [] tcp_sendmsg+0xb9/0xce0
>[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0
>[39179.385586] [] inet_sendmsg+0x64/0x60
>[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30
>[39179.4586581 [] sock_sendmsg+0x86/0xc0
>[39179.493220] [] ? __inet_stream_connect+0xa5/0x320
>[39179.528033] [] ? __fdget+0x13/0x20
>[39179.561214] [] SYSC_sendto+0x121/0x1c0
>[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150
>[39179.6268931 [] ? read_tsc+0x9/0x20
>[39179.6586541 [] ? ktime_get_ts+0x48/0xe0
>[39179.689944] [] SyS_sendto+0xe/0x10
>[39179.719575] [] system_call_fastpath+0xia/0xif
>[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
>00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
> 90 48 89 f8 48 89 d1  a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20
>4c
>8b % 4c 86
>[39179.808690] RIP  [] memcpy+0x6/0x110
>[39179.837238]  RSP 
>[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
>[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
>[39179.992896] Kernel Offset: 0x0 from 0x8100 (relocation
>range: 0x8000-0xbfff)
>[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
>interrupt
>
>It blew up when kni_net_tx() called memcpy() to copy data from the skb to
>an mbuf.
>
>Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
>needed. Plenty of experience doing RTOS and bare metal development, but
>not
>a Linux kernel expert.
>
>What context does kni_net_tx() run in? On the receive path, my
>understanding is that KNI always runs in process context on a kthread.
>I've
>been assuming that the transmit path was also in process context (albeit
>on
>the app's process), so the "Fatal exception in interrupt" is throwing me.
>
>Does kni_net_tx() ever run in interrupt (or soft-interrupt) context?
>
>Thanks,
>Jay



[dpdk-dev] Kernel panic in KNI

2016-04-06 Thread Jay Rolette
I had a system lockup hard a couple of days ago and all we were able to get
was a photo of the LCD monitor with most of the kernel panic on it. No way
to scroll back the buffer and nothing in the logs after we rebooted. Not
surprising with a kernel panic due to an exception during interrupt
processing. We have a serial console attached in case we are able to get it
to happen again, but it's not easy to reproduce (hours of runtime for this
instance).

Ran the photo through OCR software to get a text version of the dump, so
possible I missed some fixups in this:

[39178.433262] RDX: 00ba RSI: 881fd2f350ee RDI:
a12520669126180a
[39178.464020] RBP: 880433966970 R08: a12520669126180a R09:
881fd2f35000
[39178.495091] R10:  R11: 881fd2f88000 R12:
883fdla75ee8
[39178.526594] R13: 00ba R14: 7fdad5a66780 R15:
883715ab6780
[39178.559011] FS:  77fea740() GS:88lfffc0()
knlGS:
[39178.592005] CS:  0010 DS:  ES:  CR0: 80050033
[39178.623931] CR2: 77ea2000 CR3: 001fd156f000 CR4:
001407f0
[39178.656187] Stack:
[39178.689025] c067c7ef 00ba 00ba
881fd2f88000
[39178.722682] 4000 8B3fd0bbd09c 883fdla75ee8
8804339bb9c8
[39178.756525] 81658456 881fcd2ec40c c0680700
880436bad800
[39178.790577] Call Trace:
[39178.824420] [] ? kni_net_tx+0xef/0x1a0 [rte_kni]
[39178.859190] [] dev_hard_start_xmit+0x316/0x5c0
[39178.893426] [] sch_direct_xmit+0xee/0xic0
[39178.927435] [l __dev_queue_xmit+0x200/0x4d0
[39178.961684] [l dev_queue_xmit+0x10/0x20
[39178.996194] [] neigh_connected_output+0x67/0x100
[39179.031098] [] ip_finish_output+0xid8/0x850
[39179.066709] [l ip_output+0x58/0x90
[39179.101551] [] ip_local_out_sk+0x30/0x40
[39179.136823] [] ip_queue_xmit+0xl3f/0x3d0
[39179.171742] [] tcp_transmit_skb+0x47c/0x900
[39179.206854] [l tcp_write_xmit+0x110/0xcb0
[39179.242335] [] __tcp_push_pending_frames+0x2e/0xc0
[39179.277632] [] tcp_push+0xec/0x120
[39179.311768] [] tcp_sendmsg+0xb9/0xce0
[39179.346934] [] ? tcp_recvmsg+0x6e2/0xba0
[39179.385586] [] inet_sendmsg+0x64/0x60
[39179.424228] [] ? apparmor_socket_sendmsg+0x21/0x30
[39179.4586581 [] sock_sendmsg+0x86/0xc0
[39179.493220] [] ? __inet_stream_connect+0xa5/0x320
[39179.528033] [] ? __fdget+0x13/0x20
[39179.561214] [] SYSC_sendto+0x121/0x1c0
[39179.594665] [] ? aa_sk_perm.isra.4+0x6d/0x150
[39179.6268931 [] ? read_tsc+0x9/0x20
[39179.6586541 [] ? ktime_get_ts+0x48/0xe0
[39179.689944] [] SyS_sendto+0xe/0x10
[39179.719575] [] system_call_fastpath+0xia/0xif
[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
 90 48 89 f8 48 89 d1  a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20 4c
8b % 4c 86
[39179.808690] RIP  [] memcpy+0x6/0x110
[39179.837238]  RSP 
[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
[39179.992896] Kernel Offset: 0x0 from 0x8100 (relocation
range: 0x8000-0xbfff)
[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt

It blew up when kni_net_tx() called memcpy() to copy data from the skb to
an mbuf.

Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
needed. Plenty of experience doing RTOS and bare metal development, but not
a Linux kernel expert.

What context does kni_net_tx() run in? On the receive path, my
understanding is that KNI always runs in process context on a kthread. I've
been assuming that the transmit path was also in process context (albeit on
the app's process), so the "Fatal exception in interrupt" is throwing me.

Does kni_net_tx() ever run in interrupt (or soft-interrupt) context?

Thanks,
Jay