Thank you, Donald, Jesse, and all, for helping us. We'll go through the responses you gave us and let you know if we have any questions or findings.
Assaf, On Thu, Dec 7, 2023 at 9:56 PM Brandeburg, Jesse <jesse.brandeb...@intel.com> wrote: > Hi Assaf, and thanks Don for mentioning the Cisco link. > > > > I had a further look at the stats and see this: > > mac_local_faults.nic: 0 > > mac_remote_faults.nic: 1 > > > > on both the sender and receiver stats. Remote fault means the switch RX > PCS failed to maintain locked state (far end of the cable away from our > adapter). This might help you switch team or cisco figure out what is going > on. > > > > In this case I don’t think it’s the driver or the local end firmware, but > I would strongly suggest that you update the firmware to a newer version on > (some of) your cards, and you can get the updated firmware from Cisco. > > > > So, I’d be asking, why is the switch cycling or dropping the link? Hope > this helps! > > > > Jesse > > > > *From:* Buchholz, Donald <donald.buchh...@intel.com> > *Sent:* Thursday, December 7, 2023 11:05 AM > *To:* Assaf Albo <ass...@qwilt.com> > *Cc:* Brandeburg, Jesse <jesse.brandeb...@intel.com>; > e1000-devel@lists.sourceforge.net; Matan Levy <mat...@qwilt.com>; Itamar > Maron <itam...@qwilt.com> > *Subject:* RE: [e1000-devel] Intel E810 100Gb goes down sporadically > > > > Hi Assaf, > > > > Thank you for the data. I see from the data files you included that > > you are working with a Cisco-branded E810-CQDA2 NIC. > > > > As this is a Cisco supported NIC, have you consulted Cisco support > > and configured your system with Cisco-approved firmware/vendor > > versions? > > > > I do not support the Cisco products, but I see immediately that the > > NIC FW is revision 2.25. The ice driver v1.9.11 was developed at > > Intel for use with 4.xx firmware. > > > > Please contact Cisco. If it is a problem that they cannot resolve the > matter, they will reach out to the appropriate Intel support team > > for this product. > > > > Best regards, > > - Don > > > > > > *From:* Assaf Albo <ass...@qwilt.com> > *Sent:* Wednesday, December 6, 2023 3:34 AM > *To:* Buchholz, Donald <donald.buchh...@intel.com> > *Cc:* Brandeburg, Jesse <jesse.brandeb...@intel.com>; > e1000-devel@lists.sourceforge.net; Matan Levy <mat...@qwilt.com>; Itamar > Maron <itam...@qwilt.com> > *Subject:* Re: [e1000-devel] Intel E810 100Gb goes down sporadically > > > > Hey guys, > > Firstly, I'd like to thank you all for helping us out. > > Attached to this mail are two files with all the statistics (client > machine + server machine). > > > > > > > > > > *"The passthrough device shouldn't be any problem but I do recommend that > if you're passing through the device to a VM, you try to match the > destination PCIe function number to the origination ID to prevent odd > issues. like if your host device is: 01:00.1 then (I'm not sure you can do > this) I'd hope the VM device is 00:06.1, and not 00:06.0"* > > Exactly what we are doing, we are matching. > You can see in the attached files that one of the machines is working with > eth0 00:06.0 and the other eth1 00:06.1 > > > > *"Also, do you see any stats or events on the switch side when link is > lost?"* > > We use Cisco Nexus switches, and our network engineer said that he > sees events of link down from the ports. > > > > On Wed, Dec 6, 2023 at 6:42 AM Buchholz, Donald <donald.buchh...@intel.com> > wrote: > > Hi Assaf, > > In addition to the commands listed by Jesse, > please also provide "ethtool -i <eth#>" output. > This will assist us in identifying the NIC and > Firmware revision you are using. > > - Don > > > > -----Original Message----- > > From: Jesse Brandeburg <jesse.brandeb...@intel.com> > > Sent: Tuesday, December 5, 2023 10:47 AM > > To: Assaf Albo <ass...@qwilt.com>; e1000-devel@lists.sourceforge.net; > Matan > > Levy <mat...@qwilt.com> > > Subject: Re: [e1000-devel] Intel E810 100Gb goes down sporadically > > > > On 12/3/2023 1:26 AM, Assaf Albo via E1000-devel wrote: > > > Hello guys, > > > > > > We are having constant network issues in production in that the link > goes > > > down, waits *exactly* 7-8 seconds, and goes up again. > > > This can happen zero to a few times a day on all our servers; they are > not > > > in the same location and are connected to different network devices. > > > > > > Each server runs as a KVM virtual machine with 60 CPUs (Pinning) and > 224Gi > > > (Huge pages) - overall performance is excellent. > > > The NIC is PCI passed through to the KVM machine AS IS. > > > OS Rocky Linux 8.5, kernel 4.18.0-348.23.1.el8_5.x86_64 with Intel ice > > > 1.9.11 built and installed using rpm. > > > We have a traffic generator between two servers (our app: > client+server) > > > that is reaching 94Gb and can replicate this issue. > > > > > > The dmesg once the issue occur: > > > Nov 28 16:01:27 SERVER kernel: ice 0000:00:06.0 eth0: NIC Link is Down > > > Nov 28 16:01:35 SERVER kernel: ice 0000:00:06.0 eth0: NIC Link is up > 100 > > > Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, > Autoneg > > > Advertised: Off, Autoneg Negotiated: False, Flow Control: None > > > > Hi Assaf, sorry hear you're having problems. > > > > w.r.t. the link down events we need to determine if it is a local down > > or remote. > > > > Please gather the 'ethtool -S eth0' statistics for a system that has had > > some problems, and send to the list as text. > > > > also, 'ethtool -m eth0' > > > > The passthrough device shouldn't be any problem but I do recommend that > > if you're passing through the device to a VM, you try to match the > > destination PCIe function number to the origination ID to prevent odd > > issues. > > > > like if your host device is: > > 01:00.1 then (I'm not sure you can do this) I'd hope the VM device is > > 00:06.1, and not 00:06.0 > > > > So I guess with that statement I'd ask do you ever see the problem on > > systems with > > 3b:00.0 (ice PF PCIe in host) > > 00:06.0 (ice PF in VM) > > > > having the link down issues? > > > > Please include output from devlink dev info, and if you know it, what > > switch you're connected to. > > > > Also, do you see any stats or events on the switch side when link is > lost? > > > > - Jesse > > > > > > _______________________________________________ > > E1000-devel mailing list > > E1000-devel@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/e1000-devel > > To learn more about Intel Ethernet, visit > > https://community.intel.com/t5/Ethernet-Products/bd-p/ethernet-products > > _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel Ethernet, visit https://community.intel.com/t5/Ethernet-Products/bd-p/ethernet-products