On Thu, Oct 19, 2000 at 09:42:02AM +1000, Andrew Tridgell wrote:
> David,
> 
> What does "netstat -t" show at both ends when this happens?
> 
> The reason I keep coming back to this is that this is the most
> reliable way of telling whether the problem is in rsync or the
> kernel. It also tells us which end is clagged up.

Yes, I'm sorry, I knew that was important but in my haste I was having
trouble figuring out which netstat option to use and I intended to 
ask how to do that in my message and forgot.  Turns out that on solaris
the better option is "-P tcp" but actually no options are needed on
either side, those options just reduce the clutter.

> If one end shows that it has data in the sendq and it is not moving
> (see if it changes in size) and the other end shows no data in the
> recvq then you know that it must be a kernel bug. We have encountered
> that a few of times here with a network card that drops a high
> percentage of packets. (dropping packets should not causes this with
> tcp, but it seems to trigger a Linux tcp bug).

The linux server says

    Proto Recv-Q Send-Q Local Address           Foreign Address         State 
    tcp    35230  40686 expbuild.research.:8664 dynamic.ih.lucent:36352 ESTABLISHED

and the solaris client says

   Local Address               Remote Address                       Swind Send-Q Rwind 
Recv-Q  State
   dynamic.ih.lucent.com.36352 expbuild.research.bell-labs.com.8664     0   1459  8760 
     0 ESTABLISHED

so there is data in the Send-Q on the server and no data in the Recv-Q on 
the client.  The processes on the client side according to truss were both
in "poll" with a timeout of 60 seconds.

I very highly doubt a bad network card, however.  I had many machines pairs
of machines timing out on the same set of files the last couple nights (I
run a hierarchical distribution system; in general, the first level sends
across the WAN to many different geographical locations and the next level
sends to other machines on LANs), including solaris-solaris transfers.

This morning I observed that while one client process was working hard for
a long time, the other one was indeed idle a lot of the time so I am again
leaning toward the necessity of Neil Schellenberger's timeout fix.  The
above test was run with --timeout 0.  Remarkably, after it was hung for a
long time it finally did exit without any error messages.  This reminds me
of another question/(patch?) somebody posted about hangs right at the end
of a run.  I looked through the subjects in the mailing list archives for
the last three months and it didn't jump out at me; can anybody help me out?

- Dave Dykstra

Reply via email to