Two people here having sudden stop issues with onboard Intel i2XX NIC:s (em driver, i210 & i217lm). Any ideas? Re: em(4) watchdog timeotu on current/amd64

Tinker Tue, 21 Mar 2017 20:44:42 -0700

This is some form of followup on the "em(4) watchdog timeotu oncurrent/amd64" thread, https://marc.info/?t=147559714400006&r=1&w=2 , inthat that thread also discusses stop of traffic.

Hi,

tikun/dchlupacek on CC experienced a similar stop of traffic and noerrors in the dmesg, when upgrading from 5.8 to 5.9 on his C224-chipsetonboard Intel i217LM (= same family as the i210) on his production loadbalancer also, he says it took down a clients network because CARP hadtaken over as master because it couldn't communicate obviously. TheIntel i210AT on his motherboard kept operating however.


https://www.supermicro.com/products/motherboard/Xeon/C220/X10SLM-F.cfm

So, on OpenBSD 6.0 MP on my Intel C226 chipset motherboard with twoon-board Intel i210:s:


ppb2 at pci0 dev 28 function 2 "Intel 8 Series PCIE" rev 0xd5: msi
pci3 at ppb2 bus 3
em0 at pci3 dev 0 function 0 "Intel I210" rev 0x03: msi, address (MYMAC)
ppb3 at pci0 dev 28 function 3 "Intel 8 Series PCIE" rev 0xd5: msi
pci4 at ppb3 bus 4
em1 at pci4 dev 0 function 0 "Intel I210" rev 0x03: msi, address (MYMAC)

..I got the same problem two days in a row now, on the same machinewhere I got the watchdog timeouts before:


Suddenly, both NIC:s would simply stop operating, as in:

* One interface had dhclient on it, and it would simply find that theremote DHCP server stopped responding, so it went into permanentsleeping (with occasional retries that fail).


 * The other interface has a fixed IP, and trying to SSH to it failed.

So it's interesting to note that both i210:s, which are supposedlyseparate NIC chips, stopped at the same time.



No error messages are output to the dmesg.

The previous watchdog timeout error had only one fix and it was topower-off the system and start it anew. That worked as fix these lasttwo days also.

To try to track the error some more, maybe next time I can see if justrebooting the system fixes the problem, as doing that not fixed thewatchdog-error-reports-in-dmesg-and-no-data-going-through issue when Ihad it.

Since tikun had the same problem, even though I do some intense TCP/UDPwork thing and that theoretically could have been imagined to vegetablethe networking stack, I think that would be very improbable.

The only sysctl:s I have set are: net.inet.ip.forwarding=1 ,kern.maxfiles=1000000 , kern.bufcachepercent=90 .

Could I try some particular BIOS setting? (There's not many, nothingobvious there really.)

Could I force the hardware into some legacy or debug mode where itshould make more noise??

Also, if it happens again, can I lock down the error further both byinspecting the NIC's state.


Also perhaps just to double check,

* Could I do some sysctl to foolproof the networking stack, like up:ingmbuf space - emm - I guess that should not be relevant and I'm not evenaware that it can be done.* Can I do some tests on the networking stack, PF and file descriptorsto check they're not exhausted or stuck?


Any pointers much appreciated.

Thanks!
Tinker

Two people here having sudden stop issues with onboard Intel i2XX NIC:s (em driver, i210 & i217lm). Any ideas? Re: em(4) watchdog timeotu on current/amd64

Reply via email to