Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2021-01-06 Thread Fujinaka, Todd
In case you didn't see it, there appears to be a new version of i40e on 
e1000.sourceforge.net. I haven't tested it yet but I've been told that it 
should fix the issue with crashing during heavy traffic.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Fujinaka, Todd  
Sent: Wednesday, December 30, 2020 11:53 AM
To: Marc 'risson' Schmitt ; 
e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

The fix is coming and you can ignore that message. I wish I could say more, but 
at this time I don't have any more information at this time.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Marc 'risson' Schmitt  
Sent: Wednesday, December 30, 2020 11:37 AM
To: Fujinaka, Todd ; e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

On 12/30/20 5:48 PM, Fujinaka, Todd wrote:
> Can you try the previous driver? I think there's known issues with the latest.

The previous driver (2.16.6) works fine. We are now using the in-tree version. 
Both of them give the following warning, which we are ignoring as the latest 
driver has issues:

i40e :43:00.0: The driver for the device detected a newer version of the 
NVM image v1.11 than expected v1.9. Please install the most recent version of 
the network driver.

Regards,

--
Marc 'risson' Schmitt
CRI - EPITA

___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet


___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet


Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-30 Thread Fujinaka, Todd
The fix is coming and you can ignore that message. I wish I could say more, but 
at this time I don't have any more information at this time.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Marc 'risson' Schmitt  
Sent: Wednesday, December 30, 2020 11:37 AM
To: Fujinaka, Todd ; e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

On 12/30/20 5:48 PM, Fujinaka, Todd wrote:
> Can you try the previous driver? I think there's known issues with the latest.

The previous driver (2.16.6) works fine. We are now using the in-tree version. 
Both of them give the following warning, which we are ignoring as the latest 
driver has issues:

i40e :43:00.0: The driver for the device detected a newer version of the 
NVM image v1.11 than expected v1.9. Please install the most recent version of 
the network driver.

Regards,

--
Marc 'risson' Schmitt
CRI - EPITA

___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet


Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-30 Thread Marc 'risson' Schmitt
On 12/30/20 5:48 PM, Fujinaka, Todd wrote:
> Can you try the previous driver? I think there's known issues with the latest.

The previous driver (2.16.6) works fine. We are now using the in-tree
version. Both of them give the following warning, which we are ignoring
as the latest driver has issues:

i40e :43:00.0: The driver for the device detected a newer version of
the NVM image v1.11 than expected v1.9. Please install the most recent
version of the network driver.

Regards,

-- 
Marc 'risson' Schmitt
CRI - EPITA


___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet


Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-30 Thread Fujinaka, Todd
Can you try the previous driver? I think there's known issues with the latest.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Fujinaka, Todd  
Sent: Wednesday, December 30, 2020 8:03 AM
To: Marc 'risson' Schmitt ; 
e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

Unfortunately the kernel crash dump tells us very little besides that you were 
running networking at the time of the dump.

I would suggest that you file a bug here and attach the full dmesg.

Be advised that we generally need to reproduce the issue to make much progress 
and I don't know if we have any AMD systems.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Marc 'risson' Schmitt  
Sent: Tuesday, December 29, 2020 5:39 PM
To: Fujinaka, Todd ; e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

Hi,

First, thanks for your swift response!

On 12/30/20 2:22 AM, Fujinaka, Todd wrote:
> First, sourceforge strips attachment so if you want to submit them you need 
> to open a bug and attach the files there.

I'll attach an extract of the kernel logs at the end of this email.
> Second, if the hardware is Dell, you need to submit the issue to Dell and 
> they will involve us if they need help. They want to troubleshoot problems 
> with their hardware because they need to track the issues. If it is Dell 
> hardware, don't open the bug here because we'll just have to tell you again 
> to submit the issue to Dell.
> 
> The third comment is that this looks like a possible known issue and with 
> Dell hardware you need to use the Dell-approved firmware and drivers. They 
> customize the hardware and firmware and you can't use the generic versions.

The mentioned X722-DA2 were acquired from Intel directly and installed in the 
server by us. The server was indeed acquired from Dell.
The firmware for those NICs was also upgraded by us.

Regards,

--
Marc 'risson' Schmitt
CRI - EPITA

kernel: BUG: Bad page state in process swapper/20  pfn:79d345
kernel: page:f9f9de74d140 refcount:-1 mapcount:0
mapping: index:0x0
kernel: flags: 0x57c000()
kernel: raw: 0057c000 dead0100 dead0122

kernel: raw:   

kernel: page dumped because: nonzero _refcount
kernel: Modules linked in: cfg80211 xt_conntrack xt_MASQUERADE 
nf_conntrack_netlink xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ dge aufs overlay 
dell_rbu 8021q garp mrp stp llc bonding nls_iso8859_1 dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ipmi_ssif amd64_edac_mod edac_mce_amd amd_energy 
joydev input_leds cdc_ether dcdbas dell_wmi_descriptor wmi_bmof efi_pstore ccp 
k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid 
sch_fq_codel ip_tables x_tables autofs4 btrfs blake2b_generic raid1 async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mgag200 
ghash_clmulni_intel i2c_algo_bit aesni_intel drm_kms_help rect glue_helper 
sysimgblt fb_sys_fops nvme
kernel:  cec ahci rc_core tg3 nvme_core libahci drm i40e(OE) xhci_pci
i2c_piix4 xhci_pci_renesas wmi
kernel: CPU: 20 PID: 0 Comm: swapper/20 Tainted: GB   W  OE
5.8.0-33-generic #36-Ubuntu
kernel: Hardware name: Dell Inc. PowerEdge R6525/0GK70M, BIOS 1.4.8
05/06/2020
kernel: Call Trace:
kernel:  
kernel:  show_stack+0x52/0x58
kernel:  dump_stack+0x70/0x8d
kernel:  bad_page.cold+0x63/0x94
kernel:  check_new_page_bad+0x6d/0x80
kernel:  rmqueue_bulk.constprop.0+0x38f/0x4c0
kernel:  rmqueue_pcplist.constprop.0+0x128/0x150
kernel:  rmqueue+0x3e/0x770
kernel:  get_page_from_freelist+0x197/0x2c0
kernel:  __alloc_pages_nodemask+0x15d/0x300
kernel:  i40e_alloc_rx_buffers+0x14a/0x260 [i40e]
kernel:  i40e_napi_poll+0xda3/0x1720 [i40e]
kernel:  napi_poll+0x96/0x1b0
kernel:  net_rx_action+0xb8/0x1c0
kernel:  __do_softirq+0xd0/0x2a1
kernel:  asm_call_irq_on_stack+0x12/0x20
kernel:  
kernel:  do_softirq_own_stack+0x3d/0x50
kernel:  irq_exit_rcu+0x95/0xd0
kernel:  common_interrupt+0x7c/0x150
kernel:  asm_common_interrupt+0x1e/0x40
kernel: RIP: 0010:native_safe_halt+0xe/0x10
kernel: Code: e5 8b 74 d0 04 8b 3c d0 e8 6f b3 49 ff 5d c3 cc cc cc cc cc cc cc 
cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 66 ee 43 00 fb f4  90 e9 07 00 00 
00 0f 00 2d 56 ee 43 00 f4 c3 cc cc 0f
kernel: RSP: 0018:a8d68033fe70 EFLAGS: 0246
kernel: RAX: 94fcd3a0 RBX: 98a39ae5af00 RCX: 98a39f0ad440
kernel: RDX: 04fd7af6 RSI: 0014 RDI: 98a39f09fa80

Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-30 Thread Fujinaka, Todd
Unfortunately the kernel crash dump tells us very little besides that you were 
running networking at the time of the dump.

I would suggest that you file a bug here and attach the full dmesg.

Be advised that we generally need to reproduce the issue to make much progress 
and I don't know if we have any AMD systems.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Marc 'risson' Schmitt  
Sent: Tuesday, December 29, 2020 5:39 PM
To: Fujinaka, Todd ; e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: Re: [E1000-devel] [i40e][bug] driver crashes machine under high 
network pressure

Hi,

First, thanks for your swift response!

On 12/30/20 2:22 AM, Fujinaka, Todd wrote:
> First, sourceforge strips attachment so if you want to submit them you need 
> to open a bug and attach the files there.

I'll attach an extract of the kernel logs at the end of this email.
> Second, if the hardware is Dell, you need to submit the issue to Dell and 
> they will involve us if they need help. They want to troubleshoot problems 
> with their hardware because they need to track the issues. If it is Dell 
> hardware, don't open the bug here because we'll just have to tell you again 
> to submit the issue to Dell.
> 
> The third comment is that this looks like a possible known issue and with 
> Dell hardware you need to use the Dell-approved firmware and drivers. They 
> customize the hardware and firmware and you can't use the generic versions.

The mentioned X722-DA2 were acquired from Intel directly and installed in the 
server by us. The server was indeed acquired from Dell.
The firmware for those NICs was also upgraded by us.

Regards,

--
Marc 'risson' Schmitt
CRI - EPITA

kernel: BUG: Bad page state in process swapper/20  pfn:79d345
kernel: page:f9f9de74d140 refcount:-1 mapcount:0
mapping: index:0x0
kernel: flags: 0x57c000()
kernel: raw: 0057c000 dead0100 dead0122

kernel: raw:   

kernel: page dumped because: nonzero _refcount
kernel: Modules linked in: cfg80211 xt_conntrack xt_MASQUERADE 
nf_conntrack_netlink xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ dge aufs overlay 
dell_rbu 8021q garp mrp stp llc bonding nls_iso8859_1 dm_multipath scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ipmi_ssif amd64_edac_mod edac_mce_amd amd_energy 
joydev input_leds cdc_ether dcdbas dell_wmi_descriptor wmi_bmof efi_pstore ccp 
k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid 
sch_fq_codel ip_tables x_tables autofs4 btrfs blake2b_generic raid1 async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 
hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mgag200 
ghash_clmulni_intel i2c_algo_bit aesni_intel drm_kms_help rect glue_helper 
sysimgblt fb_sys_fops nvme
kernel:  cec ahci rc_core tg3 nvme_core libahci drm i40e(OE) xhci_pci
i2c_piix4 xhci_pci_renesas wmi
kernel: CPU: 20 PID: 0 Comm: swapper/20 Tainted: GB   W  OE
5.8.0-33-generic #36-Ubuntu
kernel: Hardware name: Dell Inc. PowerEdge R6525/0GK70M, BIOS 1.4.8
05/06/2020
kernel: Call Trace:
kernel:  
kernel:  show_stack+0x52/0x58
kernel:  dump_stack+0x70/0x8d
kernel:  bad_page.cold+0x63/0x94
kernel:  check_new_page_bad+0x6d/0x80
kernel:  rmqueue_bulk.constprop.0+0x38f/0x4c0
kernel:  rmqueue_pcplist.constprop.0+0x128/0x150
kernel:  rmqueue+0x3e/0x770
kernel:  get_page_from_freelist+0x197/0x2c0
kernel:  __alloc_pages_nodemask+0x15d/0x300
kernel:  i40e_alloc_rx_buffers+0x14a/0x260 [i40e]
kernel:  i40e_napi_poll+0xda3/0x1720 [i40e]
kernel:  napi_poll+0x96/0x1b0
kernel:  net_rx_action+0xb8/0x1c0
kernel:  __do_softirq+0xd0/0x2a1
kernel:  asm_call_irq_on_stack+0x12/0x20
kernel:  
kernel:  do_softirq_own_stack+0x3d/0x50
kernel:  irq_exit_rcu+0x95/0xd0
kernel:  common_interrupt+0x7c/0x150
kernel:  asm_common_interrupt+0x1e/0x40
kernel: RIP: 0010:native_safe_halt+0xe/0x10
kernel: Code: e5 8b 74 d0 04 8b 3c d0 e8 6f b3 49 ff 5d c3 cc cc cc cc cc cc cc 
cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 66 ee 43 00 fb f4  90 e9 07 00 00 
00 0f 00 2d 56 ee 43 00 f4 c3 cc cc 0f
kernel: RSP: 0018:a8d68033fe70 EFLAGS: 0246
kernel: RAX: 94fcd3a0 RBX: 98a39ae5af00 RCX: 98a39f0ad440
kernel: RDX: 04fd7af6 RSI: 0014 RDI: 98a39f09fa80
kernel: RBP: a8d68033fe90 R08: 0066a171bc54 R09: 0202
kernel: R10: 0003222e R11:  R12: 0014
kernel: R13:  R14:  R15: 

___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com

Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-30 Thread Marc 'risson' Schmitt
Hi,

First, thanks for your swift response!

On 12/30/20 2:22 AM, Fujinaka, Todd wrote:
> First, sourceforge strips attachment so if you want to submit them you need 
> to open a bug and attach the files there.

I'll attach an extract of the kernel logs at the end of this email.
> Second, if the hardware is Dell, you need to submit the issue to Dell and 
> they will involve us if they need help. They want to troubleshoot problems 
> with their hardware because they need to track the issues. If it is Dell 
> hardware, don't open the bug here because we'll just have to tell you again 
> to submit the issue to Dell.
> 
> The third comment is that this looks like a possible known issue and with 
> Dell hardware you need to use the Dell-approved firmware and drivers. They 
> customize the hardware and firmware and you can't use the generic versions.

The mentioned X722-DA2 were acquired from Intel directly and installed
in the server by us. The server was indeed acquired from Dell.
The firmware for those NICs was also upgraded by us.

Regards,

-- 
Marc 'risson' Schmitt
CRI - EPITA

kernel: BUG: Bad page state in process swapper/20  pfn:79d345
kernel: page:f9f9de74d140 refcount:-1 mapcount:0
mapping: index:0x0
kernel: flags: 0x57c000()
kernel: raw: 0057c000 dead0100 dead0122

kernel: raw:   

kernel: page dumped because: nonzero _refcount
kernel: Modules linked in: cfg80211 xt_conntrack xt_MASQUERADE
nf_conntrack_netlink xfrm_user xfrm_algo nft_counter xt_addrtype
nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_
dge aufs overlay dell_rbu 8021q garp mrp stp llc bonding nls_iso8859_1
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif
amd64_edac_mod edac_mce_amd amd_energy joydev input_leds cdc_ether
dcdbas dell_wmi_descriptor wmi_bmof efi_pstore ccp k10temp acpi_ipmi
ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid
sch_fq_codel ip_tables x_tables autofs4 btrfs blake2b_generic raid1
async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul mgag200
ghash_clmulni_intel i2c_algo_bit aesni_intel drm_kms_help
rect glue_helper sysimgblt fb_sys_fops nvme
kernel:  cec ahci rc_core tg3 nvme_core libahci drm i40e(OE) xhci_pci
i2c_piix4 xhci_pci_renesas wmi
kernel: CPU: 20 PID: 0 Comm: swapper/20 Tainted: GB   W  OE
5.8.0-33-generic #36-Ubuntu
kernel: Hardware name: Dell Inc. PowerEdge R6525/0GK70M, BIOS 1.4.8
05/06/2020
kernel: Call Trace:
kernel:  
kernel:  show_stack+0x52/0x58
kernel:  dump_stack+0x70/0x8d
kernel:  bad_page.cold+0x63/0x94
kernel:  check_new_page_bad+0x6d/0x80
kernel:  rmqueue_bulk.constprop.0+0x38f/0x4c0
kernel:  rmqueue_pcplist.constprop.0+0x128/0x150
kernel:  rmqueue+0x3e/0x770
kernel:  get_page_from_freelist+0x197/0x2c0
kernel:  __alloc_pages_nodemask+0x15d/0x300
kernel:  i40e_alloc_rx_buffers+0x14a/0x260 [i40e]
kernel:  i40e_napi_poll+0xda3/0x1720 [i40e]
kernel:  napi_poll+0x96/0x1b0
kernel:  net_rx_action+0xb8/0x1c0
kernel:  __do_softirq+0xd0/0x2a1
kernel:  asm_call_irq_on_stack+0x12/0x20
kernel:  
kernel:  do_softirq_own_stack+0x3d/0x50
kernel:  irq_exit_rcu+0x95/0xd0
kernel:  common_interrupt+0x7c/0x150
kernel:  asm_common_interrupt+0x1e/0x40
kernel: RIP: 0010:native_safe_halt+0xe/0x10
kernel: Code: e5 8b 74 d0 04 8b 3c d0 e8 6f b3 49 ff 5d c3 cc cc cc cc
cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 66 ee 43 00 fb f4
 90 e9 07 00 00 00 0f 00 2d 56 ee 43 00 f4 c3 cc cc 0f
kernel: RSP: 0018:a8d68033fe70 EFLAGS: 0246
kernel: RAX: 94fcd3a0 RBX: 98a39ae5af00 RCX: 98a39f0ad440
kernel: RDX: 04fd7af6 RSI: 0014 RDI: 98a39f09fa80
kernel: RBP: a8d68033fe90 R08: 0066a171bc54 R09: 0202
kernel: R10: 0003222e R11:  R12: 0014
kernel: R13:  R14:  R15: 


___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet


Re: [E1000-devel] [i40e][bug] driver crashes machine under high network pressure

2020-12-29 Thread Fujinaka, Todd
Sorry to hear about your problems.

First, sourceforge strips attachment so if you want to submit them you need to 
open a bug and attach the files there.

Second, if the hardware is Dell, you need to submit the issue to Dell and they 
will involve us if they need help. They want to troubleshoot problems with 
their hardware because they need to track the issues. If it is Dell hardware, 
don't open the bug here because we'll just have to tell you again to submit the 
issue to Dell.

The third comment is that this looks like a possible known issue and with Dell 
hardware you need to use the Dell-approved firmware and drivers. They customize 
the hardware and firmware and you can't use the generic versions.

Todd Fujinaka
Software Application Engineer
Data Center Group
Intel Corporation
todd.fujin...@intel.com

-Original Message-
From: Marc 'risson' Schmitt  
Sent: Tuesday, December 29, 2020 4:23 PM
To: e1000-devel@lists.sourceforge.net
Cc: c...@cri.epita.fr
Subject: [E1000-devel] [i40e][bug] driver crashes machine under high network 
pressure

Hi,

During our benchmark of a Ceph cluster, we noticed one of our machines had a 
kernel panic and needed a reboot. After checking the kernel logs (attached), it 
turns out this error came from somewhere inside the i40e driver, which is used 
for our X722-DA2 network cards. After browsing Intel's forum, we found that 
someone had a similar problem, but in Windows with their cards being in a team. 
Ours were in a LACP bond, so we decided to stress test them when not in a bond, 
and were able to reproduce the problem. It also happens when stress testing 
only one of the interfaces, and in 1Gb/s mode (the previous tests were all done 
in 10Gb/s), although it takes longer. The amount of time after which the 
machine freezes/kernel panics is proportional to the network load.
After further investigation, it seems that the number of softirqs is increasing 
by arbitrary steps (graph attached) to a point where the machine freezes. The 
fact that this bug also happens in 1Gb/s mode is leading us to believe that 
there is a memory leak in the i40e driver.
During our tests, we observed that the slab memory is constantly increasing, 
without the machine doing anything else than network operations.
The problem only happens when receiving traffic. Or at least we haven't been 
able to reproduce it when only sending traffic.
After tracing the memory allocations and deallocations made by the kernel, we 
were able to confirm that the driver leaks memory. However, this leak doesn't 
happen when ntuples are off (disabled with `ethtool --features ens1f1 ntuple 
off`).
We now plan to further analyze the memory operations made by the i40e driver 
and will report back if we find anything. In the meantime, we are opening this 
thread hoping that someone might already know of this issue and have a fix.

Some useful information:

We are running the latest version of the i40e downloaded from Intel's website 
because we had to upgrade the firmware of our network cards.
This upgrade was necessary because the cards would otherwise randomly 
disconnect and the server had to be restarted or the cables un- and re-plugged 
for the card to work again (shutting the port off and back on on the switch did 
the trick too). We did not investigate this issue any further.

The tests were ran as such: 4 iperf3 servers were listening on one node 
(hereafter called node-2), and two nodes (hereafter called node-1 and
node-3) were each running 2 iperf3 clients. As such, each interface of
node-2 was hit by two iperf3 clients, one from node-1, another from node-3. 
Note that we tried different combinations of this, with the same outputs, and 
as such we can conclude that the problem isn't due to one network card.

The included kernel logs are from the test using only one interface.
Look for the `[i40e]` pattern from the end of the file and you'll find the 
relevant stacktraces (there are many).

The bug happens when using the out-of-tree module at version 2.13.10.
It doesn't happen with the in-tree module at kernel versions 5.4.0,
5.8.2 and 5.10.2. Neither does it happen when using the out-of-tree module at 
2.12.6.

OS: Ubuntu 20.10
Kernel: 5.8.0-33-generic
i40e version: 2.13.10 - BB96E598E7BFA4F229F7E53
X722 firmware: 5.15 0x8000275d 1.2829.0
iperf: 3.7
TSO and GRO are deactivated
Server: Dell PowerEdge R6525
CPUs: 2x AMD EPYC 7352 24-Core Processor

Regards,

--
Marc 'risson' Schmitt
CRI - EPITA

___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://forums.intel.com/s/topic/0TO0P0018NbWAI/intel-ethernet