Waait - indeed there was a specific hint when the axen malbehavior
started, and that was the
axen0: watchdog timeout
axen0: usb error on tx: IN_PROGRESS
axen0: usb error on tx: TIMEOUT
output in the dmesg.
After that, the axen did not receive or send any frames.
axen_watchdog()'s logics to handle this are
sc = ifp->if_softc;
ifp->if_oerrors++;
printf("axen%d: watchdog timeout\n", sc->axen_unit);
s = splusb();
c = &sc->axen_cdata.axen_tx_chain[0];
usbd_get_xfer_status(c->axen_xfer, NULL, NULL, NULL, &stat);
axen_txeof(c->axen_xfer, c, stat);
if (!IFQ_IS_EMPTY(&ifp->if_snd))
axen_start(ifp);
splx(s);
And that obviously did not reset the NIC to function again.
And "ifconfig axen0 down" + ".. up" also did not. (And then the system
froze.)
Is there any way that I could cause a "harder" reset, in that code, or
via command line utilities?
Tinker
On 2017-04-17 01:57, Tinker wrote:
Hi Stefan/bugs@,
(This one is also intended as followup on the
https://marc.info/?t=149137421500003&r=1&w=2 thread "Disable axen(4)'s
"checksum err (pkt#N)" console/dmes" with Stefan.)
The single system permanent freeze I experienced that can have been
for any complex reason but correlated the most with me seeing the axen
stop functioning and re-plugging it, freaked me out a bit, and made me
hesitate suggesting to remove its debug output.
The only clue re "complex reason" I have is that some USB memory stick
automatically times out on this machine, maybe hinting me that the USB
implementation on this hardware may be imperfect.
I have not seen any stop-responding (and nor any issue on replugging)
since however.
With my limited code reading skills, I can't see anything that looks
prone to cause a system freeze in axen's sources
(https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/usb/if_axen.c?rev=1.24&content-type=text/x-cvsweb-markup
,
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/usb/if_axenreg.h?rev=1.6&content-type=text/x-cvsweb-markup)
-
Firstly, there are no loops in there that could go on forever (which
would be relevant if the axen code runs under the big lock, which I
guess it does; package processing takes a 16bit counting integer, I
guess that's not enough to cause any long freeze, and apart from that
the only loop in the whole code is "while (enm != NULL) { ..
ETHER_NEXT_MULTI(step, enm); }", which I guess should be failsafe).
Second, when I detached the axen, I did get "rgephy0 detached" "axen0
detached" printed in the dmesg, which I interpret as that there should
have been no toxic state waiting in the axen driver at the time of
reattachment.
One could speculate that there was toxic state or some other form of
hickups in the USB subsystem, but that would really only be guesswork
-
My only conclusion from this is that to analyze the problem further,
the problem would need to be reproduced and this time a system trace
would need to be taken from the kernel debugger.
Also one could try reattaching a bunch of times or otherwise stressing
the NIC to try to evoke the crash again.
And also of course, maybe a driver cannot be designed to keep up with
any exotic failure state within a hardware device, in particular in
combinations of devices (USB controller + axen), so maybe this was
just like the exception that proves the rule.
I didn't see any problems whatsoever with the RTL8153 though.
Thanks,
Tinker
On 2017-04-05 16:40, Tinker wrote:
Hi,
On OpenBSD 6.0 AMD64, I ran this axen device (a "Level1 USB-0401",
http://global.level1.com/Network-Card/USB-0401/p-3209.htm , not listed
as supported in axen(4) ) for several days and it worked well (only
unexpected behavior was some dmesg checksum err dmesg warnings).
The only unusual thing with this machine is that it may have a USB
storage device which self-deattaches, so if there's any overflow
between USB ports then this could perhaps be caused by that. Though
that one was on USB2 and this axen is on USB3, so that would surprise
me.
When booting, the axen would initialize as usual i.e.,
axen0 at uhub0 port 16 configuration 1 interface 0 "ASIX Elec.
Corp. AX88179" rev 3.00/1.00 addr 5
axen0: AX88179, address 00:11:6b:MYMAC
rgephy0 at axen0 phy 3: RTL8169S/8110S/8211 PHY, rev. 5
and after dhclient I'd get the two normal checksum errors,
checksum err (pkt#3)
checksum err (pkt#1)
Now, after a couple of hours, I would suddenly get another 53
repetitions of "checksum err (pkt#1)", and then this:
axen0: watchdog timeout
axen0: usb error on tx: IN_PROGRESS
axen0: usb error on tx: TIMEOUT
"ifconfig axen0" printed:
axen0: flags=8c43<UP,BROADCAST,RUNNING,OACTIVE,SIMPLEX,MULTICAST>
mtu 1500
lladdr 00:11:6b:MYMAC
index 5 priority 0 llprio 3
groups: egress
media: Ethernet autoselect (10baseT half-duplex)
status: no carrier
inet MYIP netmask 0xffffff00 broadcast MYIP24.255
"nslookup", "ping" etc. failed (nslookup: "unknown host: google.com" ,
ping: "ping: wrote 8.8.8.8 64 chars, ret=-1").
"dhclient" would go up to the "DHCPDISCOVER" step and no further.
"ifconfig axen0 down" caused no dmesg output. "ifconfig axen0 up"
caused this dmesg output:
axen0: usb errors on rx: IOERROR
I could not find any way using any system tools to force a
reattachment by software.
Detached the axen0 physically. Reattaching it, its status lights lit
up immediately, and the dmesg was populated with:
rgephy0 detached
axen0 detached
axen0 at uhub0 port 16 configuration 1 interface 0 "ASIX Elec.
Corp. AX88179" rev 3.00/1.00 addr 5
axen0: AX88179, address 00:11:6b:MYMAC
rgephy0 at axen0 phy 3: RTL8169S/8110S/8211 PHY, rev. 5
And, 1081 repetitions of this:
checksum err (pkt#0)
invalid buffer(pkt#65535), continue
..and then shortly after that, the system became unresponsive so that
I needed to reboot the machine, giving me a suspicion that some part
of the axen driver got stuck in an infinite loop.
Detaching and then reattaching the axen, its LED:s would not light up
anymore, indicating that the system is indeed really frozen. Also
neither keyboard nor network responded.
The dmsg that printed the 1081 repetitions above, did return to the
shell before the system froze though.
Any thoughts on how fix would be much appreciated.
If I see this happen again maybe I will be able to go into the kernel
console and force a kernel stack dump.
Thanks,
Tinker
On 2017-04-05 08:48, Stefan Sperling wrote:
On Wed, Apr 05, 2017 at 06:35:21AM +0000, Tinker wrote:
..
The axen(4) ( http://man.openbsd.org/axen.4 ) driver occasionally
prints a
"checksum err (pkt#N)" warning message to the console/dmesg. The rate
I'm
observing currently is about three per week, and what I see is that
it has
..
Yes, some cleanup is due here.
I suggest you send us a patch that makes it a DPRINTF. You could also
add
a line ifp->if_ierrors++;' at the same place so the error is counted
in
netstat -I axen0.
And there are some similar printfs in the same function that should
get
the same treatment, such as this one: printf("rxeof: too large
transfer\n");