This is some form of followup on the "em(4) watchdog timeotu on
current/amd64" thread, https://marc.info/?t=147559714400006&r=1&w=2 , in
that that thread also discusses stop of traffic.
Hi,
tikun/dchlupacek on CC experienced a similar stop of traffic and no
errors in the dmesg, when upgrading from 5.8 to 5.9 on his C224-chipset
onboard Intel i217LM (= same family as the i210) on his production load
balancer also, he says it took down a clients network because CARP had
taken over as master because it couldn't communicate obviously. The
Intel i210AT on his motherboard kept operating however.
https://www.supermicro.com/products/motherboard/Xeon/C220/X10SLM-F.cfm
So, on OpenBSD 6.0 MP on my Intel C226 chipset motherboard with two
on-board Intel i210:s:
ppb2 at pci0 dev 28 function 2 "Intel 8 Series PCIE" rev 0xd5: msi
pci3 at ppb2 bus 3
em0 at pci3 dev 0 function 0 "Intel I210" rev 0x03: msi, address (MYMAC)
ppb3 at pci0 dev 28 function 3 "Intel 8 Series PCIE" rev 0xd5: msi
pci4 at ppb3 bus 4
em1 at pci4 dev 0 function 0 "Intel I210" rev 0x03: msi, address (MYMAC)
..I got the same problem two days in a row now, on the same machine
where I got the watchdog timeouts before:
Suddenly, both NIC:s would simply stop operating, as in:
* One interface had dhclient on it, and it would simply find that the
remote DHCP server stopped responding, so it went into permanent
sleeping (with occasional retries that fail).
* The other interface has a fixed IP, and trying to SSH to it failed.
So it's interesting to note that both i210:s, which are supposedly
separate NIC chips, stopped at the same time.
No error messages are output to the dmesg.
The previous watchdog timeout error had only one fix and it was to
power-off the system and start it anew. That worked as fix these last
two days also.
To try to track the error some more, maybe next time I can see if just
rebooting the system fixes the problem, as doing that not fixed the
watchdog-error-reports-in-dmesg-and-no-data-going-through issue when I
had it.
Since tikun had the same problem, even though I do some intense TCP/UDP
work thing and that theoretically could have been imagined to vegetable
the networking stack, I think that would be very improbable.
The only sysctl:s I have set are: net.inet.ip.forwarding=1 ,
kern.maxfiles=1000000 , kern.bufcachepercent=90 .
Could I try some particular BIOS setting? (There's not many, nothing
obvious there really.)
Could I force the hardware into some legacy or debug mode where it
should make more noise??
Also, if it happens again, can I lock down the error further both by
inspecting the NIC's state.
Also perhaps just to double check,
* Could I do some sysctl to foolproof the networking stack, like up:ing
mbuf space - emm - I guess that should not be relevant and I'm not even
aware that it can be done.
* Can I do some tests on the networking stack, PF and file descriptors
to check they're not exhausted or stuck?
Any pointers much appreciated.
Thanks!
Tinker