> On 30 Nov 2022, at 14:36, Greg Steuck <gne...@openbsd.org> wrote:
> 
> Greg Steuck <gne...@openbsd.org> writes:
> 
>> The watched kettle never boiled. No more crashes in over two weeks
>> (instead of two in the first week). I tried a loop of alternating iperf3
>> tcp and udp to no ill effect. I still see the growth in the metrics I
>> reported, yet the system remained stable.
>> 
>> I applied the patch below and am still collecting the metrics. I doubt
>> they are responsible for the original problem.
> 
> This time the problem fired after 6 days of uptime. The system is
> running 7.2 + igc off-by-one fix.
> 
> The symptoms are:
> 
> 1 A single interface is "stuck", sometimes ping replies come back,
>  incoming packets are visible in tcpdump, no reply packets
>  appear in tcpdump (nor received on the other machine).
> 2 Other interfaces are fine to the point that I can ssh over one of
>  them to debug.
> 3 The stuck interface remains stuck after ifconfig down/up.
> 4 The stuck interface remains stuck throughout pfctl -d/-e. This did
>  reenable ping replies, that were stuck for a bit.
> 5 netstat -i shows large (and rising) value in Ofail column
> 6 netstat -m shows a number of 'mbuf 2112' stuck fairly high:
>  5539 mbufs in use:
>        5441 mbufs allocated to data
>        7 mbufs allocated to packet headers
>        91 mbufs allocated to socket names and addresses
>  40/112 mbuf 2048 byte clusters in use (current/peak)
>  4510/6630 mbuf 2112 byte clusters in use (current/peak)
> 7 no established connections to speak of
> 
> The interface stickiness is mainly its inability to send higher level
> protocol replies. E.g. a TCP connection from a remote system doesn't get
> completed. Or SYN/ACK completes, but then I can see the data to the
> machine and not even ACKs coming back. ktrace shows the application
> writes the data which just never makes it down the stack to where
> tcpdump would see it.
> 
> Obligatory graph of seemingly related counters: "mbufs in use",
> "mbuf 2112 byte clusters", and "Ofail" counts
> 
> https://docs.google.com/spreadsheets/d/e/2PACX-1vRr61USv9VNvaIq9qEs8W1wy869ai6MwNmevDLmxLJOV3DaUBcrRUzwzNZP92syltrWfrmIUWq7qevG/pubchart?oid=202363413&format=interactive
> 
> There's a long tail to the left covering 5 days of mostly nothing
> happening. The drop of "mbufs in use" from 7355 to 5739 is around the
> time I removed the system from service and possibly when I cycled
> ifconfig down/up (I have no record).
> 
> The system is still up and moved off to the side for debugging. It can
> remain up for as long as we have things to try (and power utility
> cooperates).

Ofails are the sum of output errors and queue drops. Can you figure out which 
one it is with netstat -I igc0 -e and netstat -I igc0 -d?

The state of the rxring accounting according to "systat mb" output would be 
interesting too. kstat output is easy to get too, though I'm not sure it will 
be useful in this situation.

The mbuf (and all other) pool counters from vmstat -m are easy to get too.

Reply via email to