Hi
Thanks for the response brian, The is also the first thing I noticed -
but from what I understand of Solaris is that system calls on sockets in
blocking mode is removed from active context - and the reason it came
back into context here is because it received a signal which caused it
to return. The reason I think it actually goes out of context is that it
is waiting for some part of TCP handshaking to complete.
The timestamps are indeed wallclock timestamps in %Y-%m-%d %H:%M:%S
format followed by .<wallclock nano sec> and then followed by :delta
since previous line.
Will follow your advice regarding trace on other end and see if we have
some kind of support contract with sun.
Regards
-------- Original Message --------
Subject: Re: [networking-discuss] connect(2) blocks for more than 250 ms
over loopback
From: Brian Ruthven - Solaris Network Sustaining - Sun UK
<[email protected]>
To: Iwan Aucamp <[email protected]>
Cc: [email protected]
Date: Tue Sep 22 2009 13:45:14 GMT+0200 (SAST)
I've only skimmed through this (actually, only searched for "250"!),
but this looks like an interesting part:
2009-09-21 15:08:06.673356800:+0.000ms
->(fbt:unix:1:resume:entry):()
2009-09-21 15:08:06.673356800:+0.000ms
->(fbt:genunix:1:savectx:entry):()
2009-09-21 15:08:06.673356800:+0.000ms
->(fbt:genunix:1:schedctl_save:entry):()
2009-09-21 15:08:06.673356800:+0.000ms
<-(fbt:genunix:1:schedctl_save:return):(2890546872320)
2009-09-21 15:08:06.673356800:+0.000ms
<-(fbt:genunix:1:savectx:return):(3298820544480)
2009-09-21 15:08:06.923373568:+250.017ms
->(fbt:genunix:1:restorectx:entry):()
2009-09-21 15:08:06.923373568:+0.000ms
->(fbt:genunix:1:schedctl_restore:entry):()
2009-09-21 15:08:06.923373568:+0.000ms
<-(fbt:genunix:1:schedctl_restore:return):(2890546872320)
2009-09-21 15:08:06.923373568:+0.000ms
<-(fbt:genunix:1:restorectx:return):(3298820544480)
2009-09-21 15:08:06.923373568:+0.000ms
<-(fbt:unix:1:resume:return):(3298820544480)
You don't say what the timestamp values are, but I'm assuming they are
absolute (wall clock) time followed by a delta since the last fbt probe.
If so, then 250ms was spent in resume() between the calls to savectx
and restorectx.
As best I can tell, the stack trace at this point looks something like
this:
resume()
cv_wait_sig_swap_core()
sowaitconnected()
sotpi_connect()
connect()
Although this is a network-related symptom, I suspect your problem is
deeper in the scheduler code.
Having said that, there is a comment above resume (see
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sun4/ml/swtch.s#45
for the OpenSolaris equivalent) which hints that resume() can spin
under the right conditions. You may need to enhance your dtrace to
include CPU information (to confirm whether this thread stayed on the
same cpu), and may also need to trace what the other app is doing
during the connect() call from the client.
You should continue this with your Solution Centre (through your
Solaris 10 support contract).
Regards,
Brian
Iwan Aucamp wrote:
Hi
On a T2000 node running Solaris 10 u5 connect(2) over loopback to an
application which is completely idle blocks for longer than 250 ms.
The system load averages when this occurs is also below 2 and another
system (same os, hardware and software) with much higher CPU load
does have this problem.
In an attempt to figure out what is causing this I ran dtrace with
fbt:: and syscall:connect:, extracting all fbt:: when a connect fails
(application calling connect times out after 250ms using alarm so
connect fails with EINTR).
The results of this is attached:
connect.ok.verbose : successfull connect, verbose, for reference
connect.ok : successfull connect, for reference (can
be used with diff against connect.timeout)
connect.timeout.verbose : timed-out connect, verbose
connect.timeout : timed-out connect (can be used with diff
against connect.ok)
The major difference starts at:
connect.timeout:
246: -> (fbt:ip:1:ip_wput_local:entry)
247: -> (fbt:ip:1:ip_fanout_tcp:entry)
250: ->
(fbt:ip:1:squeue_enter_nodrain:entry)
connect.ok:
240: -> (fbt:ip:1:ip_wput_local:entry)
241: -> (fbt:ip:1:ip_fanout_tcp:entry)
244: ->
(fbt:ip:1:squeue_enter_nodrain:entry)
From here
connect.timeout goes to fbt:genunix:timeout and
fbt:genunix:timeout_common.
As far as i can figure a TCP SYN packet is sent over IP but a TCP
SYN-ACK never comes back.
Does this seem like the correct interpretation, and does anybody have
any ideas regarding this ?
Regards
------------------------------------------------------------------------
_______________________________________________
networking-discuss mailing list
[email protected]
_______________________________________________
networking-discuss mailing list
[email protected]