Re: [networking-discuss] connect(2) blocks for more than 250 ms over loopback

Brian Utterback Tue, 22 Sep 2009 13:22:45 -0700

I don't have any generic, simple solutions to do this. As you noted,it can be much more difficult in Solaris to trace things that areevent driven or timer driven then it is to trace things starting atthe syscall level.

The next step I did on my problem was to capture the queue pointerused in the call, and then started tracing all function calls thattake queue pointers that had the interesting queue pointer.Unfortunately, that can only get you so far, because the queue pointerfor the other end is different. But you can find that out using mdband do the same thing at that end.


Good luck.

Iwan Aucamp wrote:

Hi Brian
My Dilemma is this, I cant trace inside accept(2) the same way i didhere because Solaris completes TCP handshaking regardless of accept(2)(given there is sufficent queue space, which there is supposed to be here).
I tried tracing all FBT (no predicates) on a test system while doing aconnect so I have a reference to look at but when doing this DTRACEdrops allot and the output is largely useless - and doing this on thelive system is out of question really.
Any ideas how to minimize the coverage of dtrace script to run on listner ?
Anyway, thanks for the hint though, will check the code - this soundsquite plausible.
Regards

-------- Original Message  --------
Subject: Re: [networking-discuss] connect(2) blocks for more than 250 msover loopback
From: Brian Utterback <[email protected]>
To: Iwan Aucamp <[email protected]>
Cc: [email protected]
Date: Tue Sep 22 2009 19:23:35 GMT+0200 (SAST)
I am afraid that this dtrace output isn't going to be useful. Theproblem is happening at the other end of the connection, in the acceptprocessing. When you do a connect, you send a message down the streamsaying that you want to connect. Often, the thread responsible forprocessing the connect will continue the processing down the streamand do the accept side processing as well, but sometimes the streamwill be locked because another thread is processing in it. In thatcase, the connect message is queued and the connect thread goes tosleep, waiting for a response. Unfortunately, in your case theresponse doesn't seem to be forthcoming, and the signal wakes it upand aborts the connect call.
I saw something similar to what you are experiencing once. It turnedout that the program incorrectly re-used sockets after failedconnection attempts. Since the sockets had some old state when connectwas called, they failed to transition through the states correctly andended up waiting for the connect to complete, even though the connectmessage was never sent. You might want to check that.
Iwan Aucamp wrote:
Hi
On a T2000 node running Solaris 10 u5 connect(2) over loopback to anapplication which is completely idle blocks for longer than 250 ms.
The system load averages when this occurs is also below 2 and anothersystem (same os, hardware and software) with much higher CPU loaddoes have this problem.
In an attempt to figure out what is causing this I ran dtrace withfbt:: and syscall:connect:, extracting all fbt:: when a connect fails(application calling connect times out after 250ms using alarm soconnect fails with EINTR).
The results of this is attached:

connect.ok.verbose        : successfull connect, verbose, for reference
connect.ok : successfull connect, for reference (canbe used with diff against connect.timeout)
connect.timeout.verbose   : timed-out connect, verbose
connect.timeout : timed-out connect (can be used with diffagainst connect.ok)
The major difference starts at:

connect.timeout:
246:                          -> (fbt:ip:1:ip_wput_local:entry)
247:                            -> (fbt:ip:1:ip_fanout_tcp:entry)
250: ->(fbt:ip:1:squeue_enter_nodrain:entry)
connect.ok:
240:                          -> (fbt:ip:1:ip_wput_local:entry)
241:                            -> (fbt:ip:1:ip_fanout_tcp:entry)
244: ->(fbt:ip:1:squeue_enter_nodrain:entry)
 From here
connect.timeout goes to fbt:genunix:timeout andfbt:genunix:timeout_common.
As far as i can figure a TCP SYN packet is sent over IP but a TCPSYN-ACK never comes back.
Does this seem like the correct interpretation, and does anybody haveany ideas regarding this ?
Regards


------------------------------------------------------------------------

_______________________________________________
networking-discuss mailing list
[email protected]


--
blu

It's bad civic hygiene to build technologies that could someday be
used to facilitate a police state. - Bruce Schneier
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] connect(2) blocks for more than 250 ms over loopback

Reply via email to