On 11/20/2017 08:36 AM, Alexander Duyck wrote: > Hi Sarah, > > I am adding the netdev mailing list as I am not certain this is an > i350 specific issue. The traces themselves aren't anything I recognize > as an existing issue. From what I can tell it looks like you are > running Xen, so would I be correct in assuming you are bridging > between VMs? If so are you using any sort of tunnels on your network, > if so what type? This information would be useful as we may be looking > at a bug in a tunnel offload for GRO.
Yes, there's bridging. The traffic on the physical device is tagged with vlans and the bridges use untagged traffic. There are no tunnels. I do not own the VMs traffic. Because I have only seen this on a single server with unique hardware, I think it's most likely related to the hardware or to a particular VM on that server. > > On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman <sarah.new...@computer.org> > wrote: >> Hi, >> >> I have an X10 supermicro with two I350's that has crashed twice now under >> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39: > > What was the last kernel you tested before v4.9.39? Just wondering as > it will help to rule out certain patches as possibly being the issue. 4.9.31. If the problem is related to a particular VM, then I don't think the last known good kernel is necessarily pertinent, as the problematic traffic could have started at any time. >> I see in the release notes >> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When >> Routing Packets." >> >> We are bridging traffic, not routing, and the crashes are in the GRO code. >> >> Is it possible there are problems with GRO for bridging in the igb driver >> now? If I disable GRO can I have some confidence it will fix the issue? > > As far as LRO not being used when routing, just so you know LRO and > GRO are two very different things. One of the issues with LRO is that > it wasn't reversible in some cases and so could lead to the packet > being changed if they were rerouted. With GRO that shouldn't be the > case as we should be able to get back out the original packets that > were put into a frame. So there shouldn't be any issues using GRO with > bridging or routing. In some very old release notes for the ixgbe https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO for bridging/routing, and it wasn't clear it was not specific to the driver. I didn't originally notice how old the release notes were and that the notice was removed in newer versions, I apologize. >> First crash: >> >> [4083386.299221] ------------[ cut here ]------------ >> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 >> inet_gro_complete+0xbb/0xd0 >> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp >> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >> ip6table_filter >> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt >> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 >> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw >> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 >> async_raid6_recov async_pq >> async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev shpchp >> i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler >> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core >> mlx4_core mpt3sas >> scsi_transport_sas raid_class wmi ast ttm >> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1 >> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >> 2.0a 09/16/2016 >> [4083386.301109] ffff880306603d90 ffffffff813f5935 0000000000000000 >> 0000000000000000 >> [4083386.301221] ffff880306603dd0 ffffffff810a7e01 000005c18174578a >> ffff8802f94a9a00 >> [4083386.301333] ffff8802f0824450 0000000000000000 0000000000000040 >> 0000000000000040 >> [4083386.301445] Call Trace: >> [4083386.301483] <IRQ> [4083386.301519] dump_stack+0x63/0x8e >> [4083386.301596] __warn+0xd1/0xf0 >> [4083386.301665] warn_slowpath_null+0x1d/0x20 >> [4083386.301747] inet_gro_complete+0xbb/0xd0 >> [4083386.301830] napi_gro_complete+0x73/0xa0 >> [4083386.301911] napi_gro_flush+0x5f/0x80 >> [4083386.301988] napi_complete_done+0x6a/0xb0 >> [4083386.302075] igb_poll+0x38d/0x720 [igb] >> [4083386.302156] ? igb_msix_ring+0x2e/0x40 [igb] >> [4083386.302255] ? __handle_irq_event_percpu+0x4b/0x1a0 >> [4083386.302349] net_rx_action+0x158/0x360 >> [4083386.302430] __do_softirq+0xd1/0x283 >> [4083386.302507] irq_exit+0xe9/0x100 >> [4083386.302580] xen_evtchn_do_upcall+0x35/0x50 >> [4083386.302665] xen_do_hypervisor_callback+0x1e/0x40 >> [4083386.302754] <EOI> [4083386.302787] ? xen_hypercall_sched_op+0xa/0x20 >> [4083386.302876] ? xen_hypercall_sched_op+0xa/0x20 >> [4083386.302965] ? xen_safe_halt+0x10/0x20 >> [4083386.303043] ? default_idle+0x1e/0xd0 >> [4083386.303122] ? arch_cpu_idle+0xf/0x20 >> [4083386.303200] ? default_idle_call+0x2c/0x40 >> [4083386.303284] ? cpu_startup_entry+0x1ac/0x240 >> [4083386.303370] ? rest_init+0x77/0x80 >> [4083386.303462] ? start_kernel+0x4a7/0x4b4 >> [4083386.303568] ? set_init_arg+0x55/0x55 >> [4083386.303670] ? x86_64_start_reservations+0x24/0x26 >> [4083386.303776] ? xen_start_kernel+0x555/0x561 >> [4083386.303873] ---[ end trace 8294f59ced689507 ]--- >> [4083386.303958] general protection fault: 0000 [#1] SMP >> [4083386.304041] Modules linked in: sb_edac edac_core 8021q mrp garp >> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >> ip6table_filter >> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs >> xen_privcmd xe >> n_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp >> ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp llc >> iTCO_wdt >> iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor >> async_memcp >> y async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus mei_me >> mei lpc_ich fjes ipmi_si ipmi_msghandler acpi_power_meter ioatdma igb dca >> raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas >> scsi_transport_sas raid_c >> lass wmi ast ttm >> [4083386.305179] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W >> 4.9.39 #1 >> [4083386.305307] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >> 2.0a 09/16/2016 >> [4083386.305414] task: ffffffff81e0e540 task.stack: ffffffff81e00000 >> [4083386.305498] RIP: e030: skb_release_data+0x73/0xf0 >> [4083386.305617] RSP: e02b:ffff880306603d90 EFLAGS: 00010206 >> [4083386.305692] RAX: 0000000000000030 RBX: f5b36db76bd162c7 RCX: >> ffffffff81e60048 >> [4083386.305790] RDX: 0000000000000001 RSI: 0000000000000000 RDI: >> ffff8802f94a9a00 >> [4083386.305887] RBP: ffff880306603db0 R08: 0000000000004277 R09: >> 0000000000000000 >> [4083386.305985] R10: 0000000000000005 R11: 0000000000000002 R12: >> 0000000000000000 >> [4083386.306083] R13: ffff8802f94a9a00 R14: ffff88032f527740 R15: >> 0000000000000040 >> [4083386.306186] FS: 0000000000000000(0000) GS:ffff880306600000(0000) >> knlGS:0000000000000000 >> [4083386.306296] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [4083386.306407] CR2: 0000000001692ed8 CR3: 000000022b3c9000 CR4: >> 0000000000042660 >> [4083386.306505] Stack: >> [4083386.306537] ffff8802f94a9a00 ffff8802f94a9a00 ffffffff8175ac3e >> 0000000000000040 >> [4083386.306649] ffff880306603dc8 ffffffff81745764 ffff8802f94a9a00 >> ffff880306603df0 >> [4083386.306762] ffffffff817457c2 ffff8802f94a9a00 ffff8802f0824450 >> 0000000000000000 >> [4083386.306874] Call Trace: >> [4083386.306911] <IRQ> [4083386.306944] ? napi_gro_complete+0x5e/0xa0 >> [4083386.307038] skb_release_all+0x24/0x30 >> [4083386.307133] kfree_skb+0x32/0x90 >> [4083386.307206] napi_gro_complete+0x5e/0xa0 >> [4083386.307287] napi_gro_flush+0x5f/0x80 >> [4083386.307365] napi_complete_done+0x6a/0xb0 >> [4083386.307449] igb_poll+0x38d/0x720 [igb] >> [4083386.307530] ? igb_msix_ring+0x2e/0x40 [igb] >> [4083386.307617] ? __handle_irq_event_percpu+0x4b/0x1a0 >> [4083386.307720] net_rx_action+0x158/0x360 >> [4083386.307800] __do_softirq+0xd1/0x283 >> [4083386.307877] irq_exit+0xe9/0x100 >> [4083386.307949] xen_evtchn_do_upcall+0x35/0x50 >> [4083386.308034] xen_do_hypervisor_callback+0x1e/0x40 >> [4083386.308124] <EOI> [4083386.308156] ? xen_hypercall_sched_op+0xa/0x20 >> [4083386.308246] ? xen_hypercall_sched_op+0xa/0x20 >> [4083386.308334] ? xen_safe_halt+0x10/0x20 >> [4083386.308413] ? default_idle+0x1e/0xd0 >> [4083386.308491] ? arch_cpu_idle+0xf/0x20 >> [4083386.308568] ? default_idle_call+0x2c/0x40 >> [4083386.308651] ? cpu_startup_entry+0x1ac/0x240 >> [4083386.308737] ? rest_init+0x77/0x80 >> [4083386.308811] ? start_kernel+0x4a7/0x4b4 >> [4083386.308890] ? set_init_arg+0x55/0x55 >> [4083386.308968] ? x86_64_start_reservations+0x24/0x26 >> [4083386.309060] ? xen_start_kernel+0x555/0x561 >> [4083386.309144] Code: f0 41 0f c1 46 20 39 c2 74 09 5b 41 5c 41 5d 41 5e 5d >> c3 45 31 e4 41 80 3e 00 74 39 49 63 c4 48 83 c0 03 48 c1 e0 04 49 8b 1c >> 06 <48> 8b 43 20 a8 01 75 6f f0 ff 4b 1c 74 55 48 8b 03 48 c1 e8 33 >> [4083386.309571] RIP skb_release_data+0x73/0xf0 >> [4083386.309658] RSP <ffff880306603d90> >> [4083386.313000] ---[ end trace 8294f59ced689508 ]--- >> [4083386.389667] Kernel panic - not syncing: Fatal exception in interrupt >> [4083386.389791] Kernel Offset: disabled >> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. Output of addr2line for address of skb_release_data+0x73 is __read_once_size include/linux/compiler.h:243 (discriminator 2) compound_head include/linux/page-flags.h:143 (discriminator 2) put_page include/linux/mm.h:777 (discriminator 2) __skb_frag_unref include/linux/skbuff.h:2592 (discriminator 2) skb_release_data net/core/skbuff.c:594 (discriminator 2) skbuff.c:594 is: __skb_frag_unref(&shinfo->frags[i]); Actual assembly is: <+91>: xor %r12d,%r12d <+94>: cmpb $0x0,(%r14) <+98>: je <skb_release_data+157> <+100>: movslq %r12d,%rax <+103>: add $0x3,%rax <+107>: shl $0x4,%rax <+111>: mov (%r14,%rax,1),%rbx <+115>: mov 0x20(%rbx),%rax <------ this is skb_release_data+0x73 <+119>: test $0x1,%al <+121>: jne <skb_release_data+234> rbx is f5b36db76bd162c7, which seems like garbage. I don't know if this looks like any particular garbage. >> Second crash: >> >> [1838269.012349] general protection fault: 0000 [#1] SMP >> [1838269.012452] Modules linked in: ebtable_nat sb_edac edac_core 8021q mrp >> garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >> ip6table_filter ip6_tables xen_pciback blktap xen_netback xen_gntdev >> xen_gntalloc xenfs xe >> n_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip >> ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp >> llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq >> async_xor xor >> async_memcpy async_tx raid10 raid6_pq libcrc32c joydev i2c_i801 i2c_smbus >> lpc_ich shpchp mei_me mei fjes ipmi_si ipmi_msghandler acpi_power_meter >> ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas >> scsi_transpor >> t_sas raid_class wmi ast ttm >> [1838269.013521] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.9.39 #1 >> [1838269.013637] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >> 2.0a 09/16/2016 >> [1838269.013743] task: ffff88030008c4c0 task.stack: ffffc90041978000 >> [1838269.013826] RIP: e030: memcpy_erms+0x6/0x10 >> [1838269.013952] RSP: e02b:ffffc9004197bac0 EFLAGS: 00010202 >> [1838269.014026] RAX: ffff88032fcafe16 RBX: 0000000000000004 RCX: >> 0000000000000004 >> [1838269.014124] RDX: 0000000000000004 RSI: 62a16ddedc6dbcb3 RDI: >> ffff88032fcafe16 >> [1838269.014222] RBP: ffffc9004197bb20 R08: 0000000000000004 R09: >> 0000000000000004 >> [1838269.014320] R10: ffff88026ae89500 R11: 0000000044639632 R12: >> 0000000000000048 >> [1838269.014417] R13: 0000000000000000 R14: 0000000044639632 R15: >> 0000000000000048 >> [1838269.014519] FS: 0000000000000000(0000) GS:ffff880306640000(0000) >> knlGS:ffff880306640000 >> [1838269.014629] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [1838269.014709] CR2: ffffffffff600400 CR3: 0000000051939000 CR4: >> 0000000000042660 >> [1838269.014808] Stack: >> [1838269.014840] ffffffff81744c17 ffff88026ae89500 0000000044639632 >> ffff88030008c4c0 >> [1838269.014952] ffffffff00000004 0000000000000004 ffff88032fcafe16 >> ffff88026ae89500 >> [1838269.015064] 0000000000000004 0000000000000004 000000000000004c >> 0000000000000028 >> [1838269.015176] Call Trace: >> [1838269.015217] ? skb_copy_bits+0x137/0x2c0 >> [1838269.015299] __pskb_pull_tail+0x7f/0x3b0 >> [1838269.015382] tcp_gro_receive+0x2c5/0x300 >> [1838269.015465] tcp6_gro_receive+0x13a/0x1a0 >> [1838269.015547] ipv6_gro_receive+0x1c6/0x380 >> [1838269.015630] dev_gro_receive+0x269/0x3b0 >> [1838269.015712] napi_gro_receive+0x38/0xf0 >> [1838269.015796] igb_clean_rx_irq+0x38e/0x690 [igb] >> [1838269.015886] igb_poll+0x362/0x720 [igb] >> [1838269.015968] ? dequeue_entity+0x26e/0xa90 >> [1838269.016051] ? xen_mc_flush+0x17b/0x1b0 >> [1838269.016131] net_rx_action+0x158/0x360 >> [1838269.016212] __do_softirq+0xd1/0x283 >> [1838269.016290] ? sort_range+0x30/0x30 >> [1838269.016366] run_ksoftirqd+0x29/0x50 >> [1838269.016443] smpboot_thread_fn+0x110/0x160 >> [1838269.016525] kthread+0xd7/0xf0 >> [1838269.016595] ? kthread_park+0x60/0x60 >> [1838269.016673] ret_from_fork+0x25/0x30 >> [1838269.016758] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 >> e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 >> d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 >> [1838269.017183] RIP memcpy_erms+0x6/0x10 >> [1838269.017264] RSP <ffffc9004197bac0> >> [1838269.020618] ---[ end trace 3506ce1d7200529a ]--- >> [1838269.079891] Kernel panic - not syncing: Fatal exception in interrupt >> [1838269.080014] Kernel Offset: disabled >> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. --Sarah ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired