On 11/20/2017 08:36 AM, Alexander Duyck wrote:
> Hi Sarah,
> 
> I am adding the netdev mailing list as I am not certain this is an
> i350 specific issue. The traces themselves aren't anything I recognize
> as an existing issue. From what I can tell it looks like you are
> running Xen, so would I be correct in assuming you are bridging
> between VMs? If so are you using any sort of tunnels on your network,
> if so what type? This information would be useful as we may be looking
> at a bug in a tunnel offload for GRO.

Yes, there's bridging. The traffic on the physical device is tagged with vlans 
and the bridges use untagged traffic. There are no tunnels. I do not
own the VMs traffic.

Because I have only seen this on a single server with unique hardware, I think 
it's most likely related to the hardware or to a particular VM on that
server.

> 
> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman <sarah.new...@computer.org> 
> wrote:
>> Hi,
>>
>> I have an X10 supermicro with two I350's that has crashed twice now under 
>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39:
> 
> What was the last kernel you tested before v4.9.39? Just wondering as
> it will help to rule out certain patches as possibly being the issue.

4.9.31.

If the problem is related to a particular VM, then I don't think the last known 
good kernel is necessarily pertinent, as the problematic traffic could
have started at any time.

>> I see in the release notes 
>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When 
>> Routing Packets."
>>
>> We are bridging traffic, not routing, and the crashes are in the GRO code.
>>
>> Is it possible there are problems with GRO for bridging in the igb driver 
>> now? If I disable GRO can I have some confidence it will fix the issue?
> 
> As far as LRO not being used when routing, just so you know LRO and
> GRO are two very different things. One of the issues with LRO is that
> it wasn't reversible in some cases and so could lead to the packet
> being changed if they were rerouted. With GRO that shouldn't be the
> case as we should be able to get back out the original packets that
> were put into a frame. So there shouldn't be any issues using GRO with
> bridging or routing.

In some very old release notes for the ixgbe 
https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO 
for bridging/routing, and it
wasn't clear it was not specific to the driver. I didn't originally notice how 
old the release notes were and that the notice was removed in newer
versions, I apologize.

>> First crash:
>>
>> [4083386.299221] ------------[ cut here ]------------
>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 
>> inet_gro_complete+0xbb/0xd0
>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp 
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>> ip6table_filter
>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt
>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 
>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw
>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 
>> async_raid6_recov async_pq
>>  async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp 
>> i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler
>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core 
>> mlx4_core mpt3sas
>>  scsi_transport_sas raid_class wmi ast ttm
>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1
>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>> 2.0a 09/16/2016
>> [4083386.301109]  ffff880306603d90 ffffffff813f5935 0000000000000000 
>> 0000000000000000
>> [4083386.301221]  ffff880306603dd0 ffffffff810a7e01 000005c18174578a 
>> ffff8802f94a9a00
>> [4083386.301333]  ffff8802f0824450 0000000000000000 0000000000000040 
>> 0000000000000040
>> [4083386.301445] Call Trace:
>> [4083386.301483]  <IRQ> [4083386.301519]   dump_stack+0x63/0x8e
>> [4083386.301596]   __warn+0xd1/0xf0
>> [4083386.301665]   warn_slowpath_null+0x1d/0x20
>> [4083386.301747]   inet_gro_complete+0xbb/0xd0
>> [4083386.301830]   napi_gro_complete+0x73/0xa0
>> [4083386.301911]   napi_gro_flush+0x5f/0x80
>> [4083386.301988]   napi_complete_done+0x6a/0xb0
>> [4083386.302075]   igb_poll+0x38d/0x720 [igb]
>> [4083386.302156]   ? igb_msix_ring+0x2e/0x40 [igb]
>> [4083386.302255]   ? __handle_irq_event_percpu+0x4b/0x1a0
>> [4083386.302349]   net_rx_action+0x158/0x360
>> [4083386.302430]   __do_softirq+0xd1/0x283
>> [4083386.302507]   irq_exit+0xe9/0x100
>> [4083386.302580]   xen_evtchn_do_upcall+0x35/0x50
>> [4083386.302665]   xen_do_hypervisor_callback+0x1e/0x40
>> [4083386.302754]  <EOI> [4083386.302787]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302876]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.302965]   ? xen_safe_halt+0x10/0x20
>> [4083386.303043]   ? default_idle+0x1e/0xd0
>> [4083386.303122]   ? arch_cpu_idle+0xf/0x20
>> [4083386.303200]   ? default_idle_call+0x2c/0x40
>> [4083386.303284]   ? cpu_startup_entry+0x1ac/0x240
>> [4083386.303370]   ? rest_init+0x77/0x80
>> [4083386.303462]   ? start_kernel+0x4a7/0x4b4
>> [4083386.303568]   ? set_init_arg+0x55/0x55
>> [4083386.303670]   ? x86_64_start_reservations+0x24/0x26
>> [4083386.303776]   ? xen_start_kernel+0x555/0x561
>> [4083386.303873] ---[ end trace 8294f59ced689507 ]---
>> [4083386.303958] general protection fault: 0000 [#1] SMP
>> [4083386.304041] Modules linked in: sb_edac edac_core 8021q mrp garp 
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev 
>> ip6table_filter
>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs 
>> xen_privcmd xe
>> n_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp 
>> ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp llc 
>> iTCO_wdt
>> iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor 
>> async_memcp
>> y async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me 
>> mei lpc_ich fjes ipmi_si ipmi_msghandler acpi_power_meter ioatdma igb dca
>> raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas 
>> scsi_transport_sas raid_c
>> lass wmi ast ttm
>> [4083386.305179] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       
>> 4.9.39 #1
>> [4083386.305307] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>> 2.0a 09/16/2016
>> [4083386.305414] task: ffffffff81e0e540 task.stack: ffffffff81e00000
>> [4083386.305498] RIP: e030:   skb_release_data+0x73/0xf0
>> [4083386.305617] RSP: e02b:ffff880306603d90  EFLAGS: 00010206
>> [4083386.305692] RAX: 0000000000000030 RBX: f5b36db76bd162c7 RCX: 
>> ffffffff81e60048
>> [4083386.305790] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
>> ffff8802f94a9a00
>> [4083386.305887] RBP: ffff880306603db0 R08: 0000000000004277 R09: 
>> 0000000000000000
>> [4083386.305985] R10: 0000000000000005 R11: 0000000000000002 R12: 
>> 0000000000000000
>> [4083386.306083] R13: ffff8802f94a9a00 R14: ffff88032f527740 R15: 
>> 0000000000000040
>> [4083386.306186] FS:  0000000000000000(0000) GS:ffff880306600000(0000) 
>> knlGS:0000000000000000
>> [4083386.306296] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [4083386.306407] CR2: 0000000001692ed8 CR3: 000000022b3c9000 CR4: 
>> 0000000000042660
>> [4083386.306505] Stack:
>> [4083386.306537]  ffff8802f94a9a00 ffff8802f94a9a00 ffffffff8175ac3e 
>> 0000000000000040
>> [4083386.306649]  ffff880306603dc8 ffffffff81745764 ffff8802f94a9a00 
>> ffff880306603df0
>> [4083386.306762]  ffffffff817457c2 ffff8802f94a9a00 ffff8802f0824450 
>> 0000000000000000
>> [4083386.306874] Call Trace:
>> [4083386.306911]  <IRQ> [4083386.306944]   ? napi_gro_complete+0x5e/0xa0
>> [4083386.307038]   skb_release_all+0x24/0x30
>> [4083386.307133]   kfree_skb+0x32/0x90
>> [4083386.307206]   napi_gro_complete+0x5e/0xa0
>> [4083386.307287]   napi_gro_flush+0x5f/0x80
>> [4083386.307365]   napi_complete_done+0x6a/0xb0
>> [4083386.307449]   igb_poll+0x38d/0x720 [igb]
>> [4083386.307530]   ? igb_msix_ring+0x2e/0x40 [igb]
>> [4083386.307617]   ? __handle_irq_event_percpu+0x4b/0x1a0
>> [4083386.307720]   net_rx_action+0x158/0x360
>> [4083386.307800]   __do_softirq+0xd1/0x283
>> [4083386.307877]   irq_exit+0xe9/0x100
>> [4083386.307949]   xen_evtchn_do_upcall+0x35/0x50
>> [4083386.308034]   xen_do_hypervisor_callback+0x1e/0x40
>> [4083386.308124]  <EOI> [4083386.308156]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.308246]   ? xen_hypercall_sched_op+0xa/0x20
>> [4083386.308334]   ? xen_safe_halt+0x10/0x20
>> [4083386.308413]   ? default_idle+0x1e/0xd0
>> [4083386.308491]   ? arch_cpu_idle+0xf/0x20
>> [4083386.308568]   ? default_idle_call+0x2c/0x40
>> [4083386.308651]   ? cpu_startup_entry+0x1ac/0x240
>> [4083386.308737]   ? rest_init+0x77/0x80
>> [4083386.308811]   ? start_kernel+0x4a7/0x4b4
>> [4083386.308890]   ? set_init_arg+0x55/0x55
>> [4083386.308968]   ? x86_64_start_reservations+0x24/0x26
>> [4083386.309060]   ? xen_start_kernel+0x555/0x561
>> [4083386.309144] Code: f0 41 0f c1 46 20 39 c2 74 09 5b 41 5c 41 5d 41 5e 5d 
>> c3 45 31 e4 41 80 3e 00 74 39 49 63 c4 48 83 c0 03 48 c1 e0 04 49 8b 1c
>> 06 <48> 8b 43 20 a8 01 75 6f f0 ff 4b 1c 74 55 48 8b 03 48 c1 e8 33
>> [4083386.309571] RIP   skb_release_data+0x73/0xf0
>> [4083386.309658]  RSP <ffff880306603d90>
>> [4083386.313000] ---[ end trace 8294f59ced689508 ]---
>> [4083386.389667] Kernel panic - not syncing: Fatal exception in interrupt
>> [4083386.389791] Kernel Offset: disabled
>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

Output of addr2line for address of skb_release_data+0x73 is

__read_once_size
include/linux/compiler.h:243 (discriminator 2)
compound_head
include/linux/page-flags.h:143 (discriminator 2)
put_page
include/linux/mm.h:777 (discriminator 2)
__skb_frag_unref
include/linux/skbuff.h:2592 (discriminator 2)
skb_release_data
net/core/skbuff.c:594 (discriminator 2)

skbuff.c:594 is:

__skb_frag_unref(&shinfo->frags[i]);

Actual assembly is:
<+91>:  xor    %r12d,%r12d
<+94>:  cmpb   $0x0,(%r14)
<+98>:  je     <skb_release_data+157>
<+100>: movslq %r12d,%rax
<+103>: add    $0x3,%rax
<+107>: shl    $0x4,%rax
<+111>: mov    (%r14,%rax,1),%rbx
<+115>: mov    0x20(%rbx),%rax <------ this is skb_release_data+0x73
<+119>: test   $0x1,%al
<+121>: jne   <skb_release_data+234>

rbx is f5b36db76bd162c7, which seems like garbage. I don't know if this looks 
like any particular garbage.

>> Second crash:
>>
>> [1838269.012349] general protection fault: 0000 [#1] SMP
>> [1838269.012452] Modules linked in: ebtable_nat sb_edac edac_core 8021q mrp 
>> garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
>> ip6table_filter ip6_tables xen_pciback blktap xen_netback xen_gntdev 
>> xen_gntalloc xenfs xe
>> n_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip 
>> ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp
>> llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq 
>> async_xor xor
>>  async_memcpy async_tx raid10 raid6_pq libcrc32c joydev i2c_i801 i2c_smbus 
>> lpc_ich shpchp mei_me mei fjes ipmi_si ipmi_msghandler acpi_power_meter
>> ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas 
>> scsi_transpor
>> t_sas raid_class wmi ast ttm
>> [1838269.013521] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.9.39 #1
>> [1838269.013637] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 
>> 2.0a 09/16/2016
>> [1838269.013743] task: ffff88030008c4c0 task.stack: ffffc90041978000
>> [1838269.013826] RIP: e030:   memcpy_erms+0x6/0x10
>> [1838269.013952] RSP: e02b:ffffc9004197bac0  EFLAGS: 00010202
>> [1838269.014026] RAX: ffff88032fcafe16 RBX: 0000000000000004 RCX: 
>> 0000000000000004
>> [1838269.014124] RDX: 0000000000000004 RSI: 62a16ddedc6dbcb3 RDI: 
>> ffff88032fcafe16
>> [1838269.014222] RBP: ffffc9004197bb20 R08: 0000000000000004 R09: 
>> 0000000000000004
>> [1838269.014320] R10: ffff88026ae89500 R11: 0000000044639632 R12: 
>> 0000000000000048
>> [1838269.014417] R13: 0000000000000000 R14: 0000000044639632 R15: 
>> 0000000000000048
>> [1838269.014519] FS:  0000000000000000(0000) GS:ffff880306640000(0000) 
>> knlGS:ffff880306640000
>> [1838269.014629] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [1838269.014709] CR2: ffffffffff600400 CR3: 0000000051939000 CR4: 
>> 0000000000042660
>> [1838269.014808] Stack:
>> [1838269.014840]  ffffffff81744c17 ffff88026ae89500 0000000044639632 
>> ffff88030008c4c0
>> [1838269.014952]  ffffffff00000004 0000000000000004 ffff88032fcafe16 
>> ffff88026ae89500
>> [1838269.015064]  0000000000000004 0000000000000004 000000000000004c 
>> 0000000000000028
>> [1838269.015176] Call Trace:
>> [1838269.015217]   ? skb_copy_bits+0x137/0x2c0
>> [1838269.015299]   __pskb_pull_tail+0x7f/0x3b0
>> [1838269.015382]   tcp_gro_receive+0x2c5/0x300
>> [1838269.015465]   tcp6_gro_receive+0x13a/0x1a0
>> [1838269.015547]   ipv6_gro_receive+0x1c6/0x380
>> [1838269.015630]   dev_gro_receive+0x269/0x3b0
>> [1838269.015712]   napi_gro_receive+0x38/0xf0
>> [1838269.015796]   igb_clean_rx_irq+0x38e/0x690 [igb]
>> [1838269.015886]   igb_poll+0x362/0x720 [igb]
>> [1838269.015968]   ? dequeue_entity+0x26e/0xa90
>> [1838269.016051]   ? xen_mc_flush+0x17b/0x1b0
>> [1838269.016131]   net_rx_action+0x158/0x360
>> [1838269.016212]   __do_softirq+0xd1/0x283
>> [1838269.016290]   ? sort_range+0x30/0x30
>> [1838269.016366]   run_ksoftirqd+0x29/0x50
>> [1838269.016443]   smpboot_thread_fn+0x110/0x160
>> [1838269.016525]   kthread+0xd7/0xf0
>> [1838269.016595]   ? kthread_park+0x60/0x60
>> [1838269.016673]   ret_from_fork+0x25/0x30
>> [1838269.016758] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 
>> e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89
>> d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>> [1838269.017183] RIP   memcpy_erms+0x6/0x10
>> [1838269.017264]  RSP <ffffc9004197bac0>
>> [1838269.020618] ---[ end trace 3506ce1d7200529a ]---
>> [1838269.079891] Kernel panic - not syncing: Fatal exception in interrupt
>> [1838269.080014] Kernel Offset: disabled
>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.

--Sarah

Reply via email to