On Tue, Jan 09, 2018 at 03:49:21PM +0100, Tobias Hommel wrote:
> On Tue, Jan 09, 2018 at 10:26:24AM +0100, Steffen Klassert wrote:
> > On Tue, Jan 09, 2018 at 10:06:51AM +0100, Tobias Hommel wrote:
> > > > 
> > > > You have CONFIG_INET_ESP_OFFLOAD enabled, this is new maybe it
> > > > still has some problems. You should not hit an offload codepath
> > > > because all your SAs are configured with UDP encapsulation which
> > > > is still not supported with offload.
I ran some new tests with 4.14.12. This time I removed encap=yes from the
strongswan config so I have plain ESP tunnels, without UDP encapsulation. Just
to be sure. It still crashes, the attached panic.noencap.log is pretty much
the same as the logs before.

> > > > 
> > > > Please try to disable GRO on both interfaces and see what happens:
> > > > 
> > > > ethtool -K eth0 gro off
> > > > ethtool -K eth1 gro off
> > > I actually already tried that with only eth1 off, to verify I turned 
> > > offloading
> > > off for both interfaces. The same problem: see attached panic.gro_off.log
> > > 
> > > > 
> > > > Then disable CONFIG_INET_ESP_OFFLOAD and try again.
> > > Rebuild with CONFIG_INET_ESP_OFFLOAD disabled, same problem: see attached
> > > panic.esp_offload_disabled.log
> > 
> > So ESP offload is not the problem. Next thing that comes to my mind
> > is the flowcache removal, this was introduced with v4.14.
> > 
> > > 
> > > > 
> > > > This should show us if this feature is responsible for the bug.
> > > > 
> > > 
> > > I will try narrowing down the problem by trying out some older kernels 
> > > for now.
> > 
> > Thanks!
> > 
> > Let me know about the results.
> 
> I copied the config from my 4.14.12 sources to a fresh 4.13.16 source tree, 
> ran
> `make olddefconfig` and built a new kernel.
> The kernel config is attached as kernel-4.13.16.config.
> The panic*.log files are kernel logs from different crashes of this 4.13.16
> kernel, but all from the same scenario as before.
> I also enabled CONFIG_DEBUG_INFO, so if any disassemblies are required, I'd be
> happy to provide them.
> 
> So, the system still crashes, but the traces are completely different from
> those with 4.14.12. This time there are also WARNINGs and "refcnt: -1" 
> messages
> sometimes before the actual panic, so not sure if there is maybe some other
> problem. Still, the crashes all seem to be related to ip routing somehow.
[ 2298.720212] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000020
[ 2298.728193] IP: xfrm_lookup+0x2a/0x7d0
[ 2298.731986] PGD 0 P4D 0 
[ 2298.734535] Oops: 0000 [#1] SMP PTI
[ 2298.738035] Modules linked in:
[ 2298.741121] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 4.14.12 #3
[ 2298.747362] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 
07/11/2016
[ 2298.755136] task: ffffa0dafb08dc00 task.stack: ffffa211c0040000
[ 2298.761091] RIP: 0010:xfrm_lookup+0x2a/0x7d0
[ 2298.765403] RSP: 0018:ffffa211c0043ad0 EFLAGS: 00010246
[ 2298.770656] RAX: 0000000000000000 RBX: ffffffff87074080 RCX: 0000000000000000
[ 2298.777851] RDX: ffffa211c0043b48 RSI: 0000000000000000 RDI: ffffffff87074080
[ 2298.785025] RBP: ffffffff87074080 R08: 0000000000000002 R09: 0000000000000000
[ 2298.792184] R10: 0000000000000020 R11: 0000000000000020 R12: ffffa211c0043b48
[ 2298.799351] R13: 0000000000000000 R14: 0000000000000002 R15: ffffa0dafb240078
[ 2298.806511] FS:  0000000000000000(0000) GS:ffffa0daffc00000(0000) 
knlGS:0000000000000000
[ 2298.814647] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2298.820428] CR2: 0000000000000020 CR3: 0000000177dcc000 CR4: 00000000001006f0
[ 2298.827587] Call Trace:
[ 2298.830045]  __xfrm_route_forward+0xa4/0x110
[ 2298.834340]  ip_forward+0x3da/0x450
[ 2298.837851]  ? ip_rcv_finish+0x61/0x390
[ 2298.841708]  ip_rcv+0x2b5/0x380
[ 2298.844871]  ? inet_del_offload+0x30/0x30
[ 2298.848910]  __netif_receive_skb_core+0x751/0xb00
[ 2298.853640]  ? netif_receive_skb_internal+0x47/0xf0
[ 2298.858573]  ? inet_gro_receive+0x1fa/0x2a0
[ 2298.862785]  netif_receive_skb_internal+0x47/0xf0
[ 2298.867523]  dev_gro_receive+0x270/0x440
[ 2298.871487]  napi_gro_receive+0x28/0x90
[ 2298.875350]  igb_poll+0x600/0xe80
[ 2298.878695]  net_rx_action+0x1fc/0x310
[ 2298.882478]  __do_softirq+0xd5/0x1cf
[ 2298.886064]  run_ksoftirqd+0x14/0x30
[ 2298.889670]  smpboot_thread_fn+0xf9/0x150
[ 2298.893707]  kthread+0xf2/0x130
[ 2298.896869]  ? sort_range+0x20/0x20
[ 2298.900387]  ? kthread_park+0x60/0x60
[ 2298.904080]  ret_from_fork+0x1f/0x30
[ 2298.907684] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 
d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> 8b 
46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 
[ 2298.926666] RIP: xfrm_lookup+0x2a/0x7d0 RSP: ffffa211c0043ad0
[ 2298.932447] CR2: 0000000000000020
[ 2298.935792] ---[ end trace 4045e2796d0dd0c8 ]---
[ 2298.940446] Kernel panic - not syncing: Fatal exception in interrupt
[ 2298.946937] Kernel Offset: 0x5000000 from 0xffffffff81000000 (relocation 
range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2298.957711] Rebooting in 10 seconds..

Reply via email to