I am afraid that this dtrace output isn't going to be useful. The
problem is happening at the other end of the connection, in the accept
processing. When you do a connect, you send a message down the
stream saying that you want to connect. Often, the thread responsible
for processing the connect will continue the processing down the
stream and do the accept side processing as well, but sometimes the
stream will be locked because another thread is processing in it. In
that case, the connect message is queued and the connect thread goes
to sleep, waiting for a response. Unfortunately, in your case the
response doesn't seem to be forthcoming, and the signal wakes it up
and aborts the connect call.
I saw something similar to what you are experiencing once. It turned
out that the program incorrectly re-used sockets after failed
connection attempts. Since the sockets had some old state when connect
was called, they failed to transition through the states correctly and
ended up waiting for the connect to complete, even though the connect
message was never sent. You might want to check that.
Iwan Aucamp wrote:
Hi
On a T2000 node running Solaris 10 u5 connect(2) over loopback to an
application which is completely idle blocks for longer than 250 ms.
The system load averages when this occurs is also below 2 and another
system (same os, hardware and software) with much higher CPU load does
have this problem.
In an attempt to figure out what is causing this I ran dtrace with fbt::
and syscall:connect:, extracting all fbt:: when a connect fails
(application calling connect times out after 250ms using alarm so
connect fails with EINTR).
The results of this is attached:
connect.ok.verbose : successfull connect, verbose, for reference
connect.ok : successfull connect, for reference (can be
used with diff against connect.timeout)
connect.timeout.verbose : timed-out connect, verbose
connect.timeout : timed-out connect (can be used with diff
against connect.ok)
The major difference starts at:
connect.timeout:
246: -> (fbt:ip:1:ip_wput_local:entry)
247: -> (fbt:ip:1:ip_fanout_tcp:entry)
250: -> (fbt:ip:1:squeue_enter_nodrain:entry)
connect.ok:
240: -> (fbt:ip:1:ip_wput_local:entry)
241: -> (fbt:ip:1:ip_fanout_tcp:entry)
244: -> (fbt:ip:1:squeue_enter_nodrain:entry)
From here
connect.timeout goes to fbt:genunix:timeout and fbt:genunix:timeout_common.
As far as i can figure a TCP SYN packet is sent over IP but a TCP
SYN-ACK never comes back.
Does this seem like the correct interpretation, and does anybody have
any ideas regarding this ?
Regards
------------------------------------------------------------------------
_______________________________________________
networking-discuss mailing list
[email protected]
--
blu
It's bad civic hygiene to build technologies that could someday be
used to facilitate a police state. - Bruce Schneier
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
networking-discuss mailing list
[email protected]