Re: Any acceptance of TCP connections suddenly BREAKS without provocation globally: gives 'connection reset by peer', 'connection refused' or blocks. Only solved by reboot. Details&analysis provided. Request for pointers as to nail down the error source.

Mikael Wed, 14 Aug 2013 08:34:30 -0700

(Karlis: A Q for you at the bottom.)


Claudio,

Your description of what happens if mclusters run out *do* match the
behavior I saw on "hangup".

However, there is an important point to clarify:

On the two occasions I've experienced it, "hangup" lasted for hours and up
to reboot.

A possible trigger for the "hangup" was that my machine was spammed
constantly to one TCP port with something like 10 new TCP connections per
second. However after entering hangup state, I shut down the TCP server
process serving that port, so if the "hangup" error was because of
mclusters running out, then also we're seeing that the TCP stack doesn't
self-heal hangups;
     Shutting down the spammed TCP server process did not make
mcluster/mbuf freeing/drainage or other auto/self-healing happen so that
the TCP stack would go out of the "hangup" mode (as in, start to accept
incoming TCP connections from lo0 and other interfaces again).

After ~2 hours of hangup, "netstat -m" changed response from "13804 Kbytes
allocated to network (97% in use)" to "964 Kbytes allocated to network (27%
in use)". If this means that mclusters & mbufs had been freed indeed, then
while hangup may start because of mclusters running out, it does not
self-heal when they're freed.

Even more weirdly, outgoing TCP connections worked perfectly while "hangup"
was still effective i.e. no new acceptance of incoming TCP connections.

(For my complete notes from hangup see my 7 July post
http://marc.info/?l=openbsd-bugs&m=137321082217664&w=2 .)

Now therefore, I wish to pose this question:

If mclusters run out, what is supposed to happen - are there robust code
paths for "self-healing" asap, or is the TCP stack implemented in such a
way that the "hangup state" not reverses but broken state remains for hours
(=Karlis' experience) or til reboot (=my experience)?


And last, if I see this happen again, if there is nay particular debug
reporting, dumping, or resetting tool I can run that would help us nail the
root cause, feel free to let me know.


Last, regarding the possibility of me experiencing hangup again: Since I
had the two experiences, the server software at the flooded port has gotten
a cap on how many connections it accepts simultaneously now. If hangup
happened because mclusters/mbufs ran out, I guess this new limit may mean I
won't experience any more hangups.

In all cases if I see hangup happen again I'll report it here. If the
frequency of occurrence that has been til now remains, I should get the
next "hangup" around 5 September.



Karlis: When you entered "hangup" state, how many new TCP connections were
coming in to your system per second, how many TCP connections were you
having open in total, and about how much data was going though them?

If you experience the "hangup" again, feel free to run the commands listed
in my post from 16 hours ago (
http://marc.info/?l=openbsd-bugs&m=137643442326455&w=2) , those were the
commands the #OpenBSD IRC channel suggested that I run on "hangup", more
notes in the 7 July post. At least this way we could check for similarities.


Best regards,
Mikael

2013/8/14 Claudio Jeker <[email protected]>


> > What would be OpenBSD's ordinary transient and lasting behavior if
> > mclusters run out? (I guess mclusters are supergroups of mbufs?)
>
> If there are no more clusters the system will break in strange ways.
> Like accepting no new connections or having issues to send more packets.
> If there is no cluster some parts of the network stack will replace them
> with a mbuf chain of about 8 mbufs, other parts will just fail and drop
> data.
>
> The 6144 clusters we allow by default is for a basic system any kind of
> halfway busy server with many connections needs to increase that limit.
>
> > So, my nmbcluster is set to 200MB so just in case there was a peak
> mcluster
> > use of 200MB, there would be allocation failures?
>
> 5842 * 2048 = 11684k or 11MB the other 100MB are most probably used all up
> by mbufs which is very bad. Didn't the system tell you to increase about
> the limit?
> Something like "WARNING: mclpools limit reached; increase kern.maxclusters"
>
> > Do you have any thought on where the mclusters may have went?
>
> TCP send/recv buffers is one possible place. netstat -an reports the
> amount of memory used up by each socketbuffer.
>
> --
> :wq Claudio

Re: Any acceptance of TCP connections suddenly BREAKS without provocation globally: gives 'connection reset by peer', 'connection refused' or blocks. Only solved by reboot. Details&analysis provided. Request for pointers as to nail down the error source.

Reply via email to