Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction

Alexander Leidinger Fri, 17 Mar 2017 06:16:02 -0700

Quoting "O. Hartmann" <[email protected]> (from Fri, 17 Mar 2017 12:20:18 +0100):

Since the introduction of the IFLIB changes, I realise severe problems on
CURRENT.


I already reported something like this to sbruno@ and M. Macy (in copy).

Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri Mar 17
10:46:04 CET 2017 amd64), the problems on a workstation got severe within the
past two days:

since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on heavy
I/O. I realised this first when "rsync"ing poudriere repositories to a remote
NFSv4 (automounted) folder. The em0 device could be revived by ifconfig down/up
procedure.
But not the i217-LM chip is affected. On another box equipted with a i350 dual port GBit NIC I observed a similar behaviour under (artificially) high I/O load
(but I didn't investigate that further since it occured very seldom).


It's not only those chipsets.

It may be beneficial if you could provide the pciconf output for those devices. Mine is:

---snip---

em0@pci0:2:6:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00

    vendor     = 'Intel Corporation'
    device     = '82541PI Gigabit Ethernet Controller'
---snip---

Now, since around yesterday, the i217-LM dies without being reviveable with
ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius M740)

I don't know if for the chip I see this issue with a simple down/up would help (it's a headless server in a remote datacenter). For the moment I'm using the workaround of something like "ping -C 1 <gateway> || shutdown -r now" in crontab.


The system in question is at r314137.

remains with a dead em0 device, reporting "no route" in some occasions but
stuck in the dead state. Every attempt to establish manually the route again
fails, only rebooting the box gives some relief.

On the console, I have some very strange reports:

- ping reports suddenly about no buffer space

- or I see sometimes massive occurences of "em0: TX(0) desc avail = 1024, pidx

  = 0" on the console

I don't see this in messages or console log, but I see that ntpd can't resolve hostnames in the logs.

Either way, sending/receiving large files on an established network GBit line
which could be saturated by approx 100 MBytes/s tend to make the NIC fail.

I can report that the "svnlite update" on the box of of the FreeBSD src tree is able to trigger the issue in my case.

I have to add that before the iflib changes I've seen frequent em-watchdog timeouts in the logs / dmesg. So for me we have two issues here:

 - the hardware wasn't 100% supported before the iflib changes (it seems)

- the iflib changes have lost some watchdog functionality / auto-failure-recovery feature


Bye,
Alexander.

--
http://www.Leidinger.net [email protected]: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org    [email protected]  : PGP 0x8F31830F9F2772BF

pgpqIGQZOTqna.pgp
Description: Digitale PGP-Signatur

Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction

Reply via email to