On Fri, Oct 20, 2000 at 08:43:59AM +1000, Andrew Tridgell wrote:
> > The linux server says
> > 
> >     Proto Recv-Q Send-Q Local Address           Foreign Address         State 
> >     tcp    35230  40686 expbuild.research.:8664 dynamic.ih.lucent:36352 ESTABLISHED
> > 
> > and the solaris client says
> > 
> >    Local Address               Remote Address                       Swind Send-Q 
>Rwind Recv-Q  State
> >    dynamic.ih.lucent.com.36352 expbuild.research.bell-labs.com.8664     0   1459  
>8760      0 ESTABLISHED
> 
> ok, if the above condition is not temporary (ie. not just packet loss)
> and the cable between the two boxes is OK then this is _definately_ a
> OS bug. The job of TCP is to get data from the sendq on one side to
> the recvq on the other. The only reason that data would not be sent on
> an ESTABLISHED connection is if the window was zero, and you don't get
> that with a zero sized recvq.
> 
> It is quite impossible for rsync to cause the above condition. The
> rsync server has written some data to a socket in the expectation that
> it will get to the other end (that's what reliable transports are all
> about), but the data hasn't got there.
> 
> The next thing you have to do is run a sniffer to determine whether it
> is a Solaris or Linux bug. My bet is this will be the same Linux bug
> we have observed here. You'll see the Linux box sending data outside
> the window that the Solaris box is offering, the Solaris box will
> reject that data by sending a ack with the current window and the
> Linux box will ignore the hint.

...

On Fri, Oct 20, 2000 at 10:44:04AM +1000, Andrew Tridgell wrote:
...
> Stephen tells me that a patch went into the 2.2.17 Linux kernel that
> was supposed to fix this particular problem. If you get a chance it
> would be worth trying that kernel (or a 2.2.18preX) to see if it
> solves your problem.

We upgraded to 2.2.17 but it didn't help, unfortunately.  An SGI machine in
the same machine room doesn't exhibit the problem with --timeout 0, though,
so you're probably right.  I didn't get a chance to do any sniffing but
I'll try doing that later I guess.


...
> > This morning I observed that while one client process was working hard for
> > a long time, the other one was indeed idle a lot of the time so I am again
> > leaning toward the necessity of Neil Schellenberger's timeout fix.  The
> > above test was run with --timeout 0.
> 
> Neil's analysis is quite plausible and worth looking into but that is
> most definately not what is causing the hang you see here.

I'm thinking that this is the likely cause of most of my timeouts with
large updates, other than the Linux freeze-up.  I will try to do more
testing, and I think it would be very helpful if you could scrutinize his
patch to see if you think it is a good solution.

Thanks,

- Dave Dykstra

Reply via email to