I'm hung right now as well. My solaris -> linux rsync on the switched
100M network ran ok, but it seems to have hung on a solaris -> solaris
transfer. I'm using SSH Version OpenSSH_2.1.1 on both ends. Dave is
running daemon, I'm doing ssh, but it seems like similar hangs.


Sending side:
===============
Solaris7 Rsync 2.4.6

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q 
State
-------------------- -------------------- ----- ------ ----- ------
-------
herc.798             soho.22               8760      0  8760      0
ESTABLISHED


ps  report
-----------
root 17408 17407  0 22:00:22 ?        0:19 /usr/local/bin/ssh soho
rsync --server -vvogDtprz --timeout=3600 --delete --par .......
root 17407 17406  0 22:00:21 ?        1:55 /usr/local/bin/rsync
-rptgoD --partial --delete-after -vv --delete -e 


# truss -p 17407
poll(0xFFBEFAD0, 0, 1)                          = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20)                         = 0
poll(0xFFBEFAD0, 0, 1)                          = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20)                         = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20)                         = 0


NOTE: the ssh channel is open and appears happy. The send/recv queues
are empty on the sending side. Also -- the timeout is 1 hr, but it has
apparently not timed out. It is almost 12 hours since the 22:00 when I
started this rsync. I wonder why this did not timeout.


RECEIVING SIDE
==============
Solaris7 Rsync 2.4.6

   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q 
State
-------------------- -------------------- ----- ------ ----- ------
-------
soho.22              herc.798              8760      0  8760      0
ESTABLISHED                     

soho% ps -aef |grep rsync
(nothing returned)

NOTE: I see a ssh connection, but no rsync process. 













Dave Dykstra wrote:
> 
> On Thu, Oct 19, 2000 at 09:42:02AM +1000, Andrew Tridgell wrote:
> > David,
> >
> > What does "netstat -t" show at both ends when this happens?
> >
> > The reason I keep coming back to this is that this is the most
> > reliable way of telling whether the problem is in rsync or the
> > kernel. It also tells us which end is clagged up.
> 
> Yes, I'm sorry, I knew that was important but in my haste I was having
> trouble figuring out which netstat option to use and I intended to
> ask how to do that in my message and forgot.  Turns out that on solaris
> the better option is "-P tcp" but actually no options are needed on
> either side, those options just reduce the clutter.
> 
> > If one end shows that it has data in the sendq and it is not moving
> > (see if it changes in size) and the other end shows no data in the
> > recvq then you know that it must be a kernel bug. We have encountered
> > that a few of times here with a network card that drops a high
> > percentage of packets. (dropping packets should not causes this with
> > tcp, but it seems to trigger a Linux tcp bug).
> 
> The linux server says
> 
>     Proto Recv-Q Send-Q Local Address           Foreign Address         State
>     tcp    35230  40686 expbuild.research.:8664 dynamic.ih.lucent:36352 ESTABLISHED
> 
> and the solaris client says
> 
>    Local Address               Remote Address                       Swind Send-Q 
>Rwind Recv-Q  State
>    dynamic.ih.lucent.com.36352 expbuild.research.bell-labs.com.8664     0   1459  
>8760      0 ESTABLISHED
> 
> so there is data in the Send-Q on the server and no data in the Recv-Q on
> the client.  The processes on the client side according to truss were both
> in "poll" with a timeout of 60 seconds.
> 
> I very highly doubt a bad network card, however.  I had many machines pairs
> of machines timing out on the same set of files the last couple nights (I
> run a hierarchical distribution system; in general, the first level sends
> across the WAN to many different geographical locations and the next level
> sends to other machines on LANs), including solaris-solaris transfers.
> 
> This morning I observed that while one client process was working hard for
> a long time, the other one was indeed idle a lot of the time so I am again
> leaning toward the necessity of Neil Schellenberger's timeout fix.  The
> above test was run with --timeout 0.  Remarkably, after it was hung for a
> long time it finally did exit without any error messages.  This reminds me
> of another question/(patch?) somebody posted about hangs right at the end
> of a run.  I looked through the subjects in the mailing list archives for
> the last three months and it didn't jump out at me; can anybody help me out?
> 
> - Dave Dykstra

-- 
__________________________________________________________________
Eric T. Whiting                                 AMI Semiconductors   
(208) 234-6717                                  2300 Buckskin Road
(208) 234-6659 (fax)                            Pocatello,ID 83201
[EMAIL PROTECTED]

Reply via email to