Re: socket shutdown delay?

Terry Lambert Wed, 16 Jan 2002 17:24:50 -0800

Chad David wrote:
> The direct cause is a bug in my client.  I call close(2) out side of the
> main loop (one line off :( ), so none of the client side sockets were
> getting closed.  When I fixed this all of the connections went to
> TIME_WAIT right away.
> 
> I'm still not convinced that all is well though, as on Solaris 5.9 and
> 4.4-STABLE I do not see the problem with the bad client.


So it's the resource track close of the sockets.

If the client and the server were the same program, you could
be seeing this as a timing thing on order of operation.  I'm
guessing they aren't, though...


> I'll address your points below, but if you don't feel like chasing this
> anymore that is fine with me... I'll add it to my list of things to
> try and understand on my next vacation :).

Unless there's something that jumps out at you, this is probably
a good plan.  8-).


> > Also make sure that the keepalive sysctl is set on (1).

[ ... it's on, so it's not the RST instead of FIN/FIN-ACK/2MSL
      losing the RST that isn't retransmitted ... ]


> > You should probably call shutdown(2), if you want your code
> > to be mostly correct.
> 
> Call shutdown(2) instead of close(2)?

Nope.  Before close.  Depending on the argument, perhaps not
before the last read or write, then the close.


> > I suspect that you are just doing a large number of connections.
> 
> One connection at a time, as fast as the client can loop, with
> a small (1k) amount of data being returned by the server.

So you would shutdown() after the request, but before reading
the response, to indicate that you have no more request data
to send.


> > My guess is that you have ran out of mbufs (your usage stats
> > tell me nothing about the abailable number of real mbufs;
> > even the "0 requests for memory denied" is not really as
> > useful as it would appear in the stats), or you just have
> > an incredibly large number of files open.
> 
> colnta->sysctl -a | grep mbuf
> kern.ipc.nmbufs: 67584
> kern.ipc.mbuf_wait: 64
> kern.ipc.mbuf_limit: 512

This number is how many mbufs possible.  It represents the map
size for the page table entries, and doesn't really indicate
that there are physical pages of RAM available to back them.

However, since this is only ~33M of memory, this is nothing.

With the 1K size of the data you are sending from the server,
this puts you at a connection max of 16,000, assuming all
data is sent but not yet ACK'ed... or 8,000, given that both
client and server are on the same machine.  The absolute worst
case sits down around 6,000 (ACK packets, driver mbufs, socket
option mbufs, etc., dragging it down a little).

So it's unrelated to that, but we already knew that because
of the program change.  8-).


> > > The client eventually fails with EADDRNOTAVAIL.
[ ... ]
> With the fixed client it never fails.  I moved a few GB through it
> without any problem.

You will want to up the user ports on the clients when you start
stress testing it from multiple client machines, anyway.

> > This indicates a 2MSL draining.  The resource track close could
> > also be slow.  You could probably get an incredible speedup by
> > doing explicit closes in the client program, starting with the
> > highest used fd, and working down, instead of going the other
> > way (it's probably a good idea to modify the FreeBSD resource
> > track close to so the same thing).
> 
> If I had been doing any explicit closes :(.

Yes, but your ordering is reverse optimal, actually, so you are
going to be rate limited at the client.

Did the client actually exit?  If it didn't, that would explain
everything.

> > There are some other inefficiencies in the fd code that can be
> > addressed... nominally, the allocation is a linear search at
> > the last valid one going higher.  For most servers, this could
> > be significantly improved by linking free fd's in a sparse
> > list onto a "freelist", and maintaining a pointer to that,
> > instead of the index to the first free one, but that should only
> > impact you on allocation (like the inpcb hash, which fails
> > pretty badly, even when you tune up the hash size to some
> > unreasonable amount, and the port allocation for outbound
> > connections, which is, frankly, broken.  Both could benefit from
> > a nice btree overhaul).
> 
> I actually implemented something for this type of problem over Christmas
> with one of the Solaris engineers.  It was inspired by Jeff Bonwick's
> vmem stuff (Usenix 2001), but was bit mask based, so the actual storage
> overhead was a lot less, with what appeared to be very good allocate and
> free times (O(n) as the worst case with O(1) typically).

This would be nice for FreeBSD, assuming we could pry it out
of you.  8-).


[ ... timer code, Rice U. Opportunistic Timers ... ]

> I think I have that paper around here somewhere... is it older,
> like from around 1990?

No, you are probably thinking of the WRL paper by Jeff Mogul.
The paper I'm referring to is late mid-90's.


> > > Nope.  Stock -current, none of my patches applied.
> >
> > Heh... "not useful information without a date of cvsup,
> > and then possibly not even then".  Moving target problems...
> 
> The original email has the uname and a dmesg, but:
> FreeBSD colnta 5.0-CURRENT FreeBSD 5.0-CURRENT #17: Sun Jan 13 03:51:32 MST 2002

I would need to check it out, and build my own copy, and see
if I could repeat it (I'd need your broken client and your
server code.  It would be much better to see if other well
known and tested versions of FreeBSD exhibited the same symptoms,
and, if not, track it down with a bsearch of the CVS tree by date.


> > Can you repeat this on 4.5RC?  If so, try 4.4-RELEASE.  It
> > may be related to the SYN cache code.
> 
> I do not have a RC or RELEASE box, but 4.4-STABLE does not do this.

OK, that's interesting.

The reason I meantioned the RC stuff (FreeBSD 4.5 from the RELENG_4
head tag checkout) is that a major difference is the MFC of the SYN
cache/cookie code.

> > The SYN-cookie code is vulnerable to the "ACK gun" attack,
> > and since the SYN cache code falls back into SYN cookie
> > (it assumes that the reason it didn't find the corresponding
> > SYN in the SYN cache is that it overflowed and was discarded,
> > turning naked ACK attempts into SYN-cookie attempts completely
> > automatically), you might be hitting it that way.
> >
> > If that's the case, then I suggest leaving the SYN cache
> > enabled, and disabling the SYN cookie.  If that doesn't fix
> > it, then you may also want to try disabling the SYN cache.
> 
> I'll have to look into this stuff to understand what you are saying.

Do that; it's probably where the problem is, given that 4.4
doesn't have it.


> > Other than that, once you've tried this, then I will need to
> > know what the failure modes are, and then more about the
> > client and server code (kqueue based?  Standard sockets
> > based?), and then I can suggest more to narrow it down.
> 
> Very simple sockets.  Basically:
>         ... accept() -> read() -> write() -> close() ...

OK; there's a potential latency issue in the accept filter
code, but if you aren't using kqueue, then that's not it.

> The actual read(), write(), close(), takes place in a seperate thread,
> but there is only one thread active at a time.

Yep; ignore this angle on the problem.


> > Another thing you may want to try is delay closing the
> > server side of the connection for 1-2 seconds after the
> > last write.  This is the canonical way of forcing a client
> > to do the close first in all cases, which totally avoids
> > the server-side-close-first case, which also avoids the
> > FIN_WAIT_2.  For real code, you would have to add a "close
> > cache" and timer.
> 
> Give that each connection is in its own thread this is very doable...

This would at least isolate it to the client vs. server code
and order of operation.  If it's the server close, then the
issue is perhaps a re-ACK after FIN ACK from the close making
the SYN cache think it has a new connection, then a re-FIN,
with the close, to get it into the strange state...


> > Hope this helps...
> 
> If nothing else I'm learning... I just wish I could read as fast
> as you can type :).

Heh.  My max rate is 135 WPM, which is 6*135 CPM or 13.5 CPS, or
135 BAUD, which is slow as molasses compared to most people's
read rates (by about 2 orders of magnitude).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Re: socket shutdown delay?

Reply via email to