I'm hung right now as well. My solaris -> linux rsync on the switched
100M network ran ok, but it seems to have hung on a solaris -> solaris
transfer. I'm using SSH Version OpenSSH_2.1.1 on both ends. Dave is
running daemon, I'm doing ssh, but it seems like similar hangs.
Sending side:
===============
Solaris7 Rsync 2.4.6
Local Address Remote Address Swind Send-Q Rwind Recv-Q
State
-------------------- -------------------- ----- ------ ----- ------
-------
herc.798 soho.22 8760 0 8760 0
ESTABLISHED
ps report
-----------
root 17408 17407 0 22:00:22 ? 0:19 /usr/local/bin/ssh soho
rsync --server -vvogDtprz --timeout=3600 --delete --par .......
root 17407 17406 0 22:00:21 ? 1:55 /usr/local/bin/rsync
-rptgoD --partial --delete-after -vv --delete -e
# truss -p 17407
poll(0xFFBEFAD0, 0, 1) = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20) = 0
poll(0xFFBEFAD0, 0, 1) = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20) = 0
waitid(P_PID, 17408, 0xFFBEFB00, WEXITED|WTRAPPED|WNOHANG) = 0
poll(0xFFBEFAD0, 0, 20) = 0
NOTE: the ssh channel is open and appears happy. The send/recv queues
are empty on the sending side. Also -- the timeout is 1 hr, but it has
apparently not timed out. It is almost 12 hours since the 22:00 when I
started this rsync. I wonder why this did not timeout.
RECEIVING SIDE
==============
Solaris7 Rsync 2.4.6
Local Address Remote Address Swind Send-Q Rwind Recv-Q
State
-------------------- -------------------- ----- ------ ----- ------
-------
soho.22 herc.798 8760 0 8760 0
ESTABLISHED
soho% ps -aef |grep rsync
(nothing returned)
NOTE: I see a ssh connection, but no rsync process.
Dave Dykstra wrote:
>
> On Thu, Oct 19, 2000 at 09:42:02AM +1000, Andrew Tridgell wrote:
> > David,
> >
> > What does "netstat -t" show at both ends when this happens?
> >
> > The reason I keep coming back to this is that this is the most
> > reliable way of telling whether the problem is in rsync or the
> > kernel. It also tells us which end is clagged up.
>
> Yes, I'm sorry, I knew that was important but in my haste I was having
> trouble figuring out which netstat option to use and I intended to
> ask how to do that in my message and forgot. Turns out that on solaris
> the better option is "-P tcp" but actually no options are needed on
> either side, those options just reduce the clutter.
>
> > If one end shows that it has data in the sendq and it is not moving
> > (see if it changes in size) and the other end shows no data in the
> > recvq then you know that it must be a kernel bug. We have encountered
> > that a few of times here with a network card that drops a high
> > percentage of packets. (dropping packets should not causes this with
> > tcp, but it seems to trigger a Linux tcp bug).
>
> The linux server says
>
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 35230 40686 expbuild.research.:8664 dynamic.ih.lucent:36352 ESTABLISHED
>
> and the solaris client says
>
> Local Address Remote Address Swind Send-Q
>Rwind Recv-Q State
> dynamic.ih.lucent.com.36352 expbuild.research.bell-labs.com.8664 0 1459
>8760 0 ESTABLISHED
>
> so there is data in the Send-Q on the server and no data in the Recv-Q on
> the client. The processes on the client side according to truss were both
> in "poll" with a timeout of 60 seconds.
>
> I very highly doubt a bad network card, however. I had many machines pairs
> of machines timing out on the same set of files the last couple nights (I
> run a hierarchical distribution system; in general, the first level sends
> across the WAN to many different geographical locations and the next level
> sends to other machines on LANs), including solaris-solaris transfers.
>
> This morning I observed that while one client process was working hard for
> a long time, the other one was indeed idle a lot of the time so I am again
> leaning toward the necessity of Neil Schellenberger's timeout fix. The
> above test was run with --timeout 0. Remarkably, after it was hung for a
> long time it finally did exit without any error messages. This reminds me
> of another question/(patch?) somebody posted about hangs right at the end
> of a run. I looked through the subjects in the mailing list archives for
> the last three months and it didn't jump out at me; can anybody help me out?
>
> - Dave Dykstra
--
__________________________________________________________________
Eric T. Whiting AMI Semiconductors
(208) 234-6717 2300 Buckskin Road
(208) 234-6659 (fax) Pocatello,ID 83201
[EMAIL PROTECTED]