On Thu, Oct 19, 2000 at 09:42:02AM +1000, Andrew Tridgell wrote:
> David,
>
> What does "netstat -t" show at both ends when this happens?
>
> The reason I keep coming back to this is that this is the most
> reliable way of telling whether the problem is in rsync or the
> kernel. It also tells us which end is clagged up.
Yes, I'm sorry, I knew that was important but in my haste I was having
trouble figuring out which netstat option to use and I intended to
ask how to do that in my message and forgot. Turns out that on solaris
the better option is "-P tcp" but actually no options are needed on
either side, those options just reduce the clutter.
> If one end shows that it has data in the sendq and it is not moving
> (see if it changes in size) and the other end shows no data in the
> recvq then you know that it must be a kernel bug. We have encountered
> that a few of times here with a network card that drops a high
> percentage of packets. (dropping packets should not causes this with
> tcp, but it seems to trigger a Linux tcp bug).
The linux server says
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 35230 40686 expbuild.research.:8664 dynamic.ih.lucent:36352 ESTABLISHED
and the solaris client says
Local Address Remote Address Swind Send-Q Rwind
Recv-Q State
dynamic.ih.lucent.com.36352 expbuild.research.bell-labs.com.8664 0 1459 8760
0 ESTABLISHED
so there is data in the Send-Q on the server and no data in the Recv-Q on
the client. The processes on the client side according to truss were both
in "poll" with a timeout of 60 seconds.
I very highly doubt a bad network card, however. I had many machines pairs
of machines timing out on the same set of files the last couple nights (I
run a hierarchical distribution system; in general, the first level sends
across the WAN to many different geographical locations and the next level
sends to other machines on LANs), including solaris-solaris transfers.
This morning I observed that while one client process was working hard for
a long time, the other one was indeed idle a lot of the time so I am again
leaning toward the necessity of Neil Schellenberger's timeout fix. The
above test was run with --timeout 0. Remarkably, after it was hung for a
long time it finally did exit without any error messages. This reminds me
of another question/(patch?) somebody posted about hangs right at the end
of a run. I looked through the subjects in the mailing list archives for
the last three months and it didn't jump out at me; can anybody help me out?
- Dave Dykstra