Quick follow to my previous posting today.

I'm now using the -W flag on 2.4.6 again and it just hung.  Luckily a
'kill -HUP' to the process it's waiting on will do the trick in terms
of killing rsync cleaning.   But it also means I have to sit and watch
the damm thing do it's work and wait until the CPU useage drops to 1%
or so.

Here's a process listing:

    atlas:/newtoast192/users# ps -ef | grep rsync
        root  8809  8808  0 14:42:07 pts/2    0:12 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh
        root  8808  8674  0 14:42:07 pts/2    5:01 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh
        root  8819  8809  0 14:43:30 pts/2   14:57 
/home/stoffel/src/Tools/rsync-2.4.6/rsync --archive --delete --exclude ".snapsh


So I do a truss on one pid and I see....

    atlas:/newtoast192/users# truss -p 8809
    poll(0xEFFFB6E0, 1, 60000)      (sleeping...)
    ^C

Oops, wrong pid.  I then check the previous one and I see:

    atlas:/newtoast192/users# truss -p 8808
    poll(0xEFFFD3B0, 0, 20)                         = 0
    waitid(P_PID, 8809, 0xEFFFF3B8, WEXITED|WTRAPPED|WNOHANG) = 0
    poll(0xEFFFD3B0, 0, 20)                         = 0
    waitid(P_PID, 8809, 0xEFFFF3B8, WEXITED|WTRAPPED|WNOHANG) = 0
    poll(0xEFFFD3B0, 0, 20)                         = 0
    poll(0xEFFFD3B0, 0, 7)                          = 0
    waitid(P_PID, 8809, 0xEFFFF3B8, WEXITED|WTRAPPED|WNOHANG) = 0
    poll(0xEFFFD3B0, 0, 20)                         = 0
    poll(0xEFFFD3B0, 0, 1)                          = 0


And it just sits there in waitid() and poll().  So it really looks
like the real problem is in the child process, pid=8809, because it's
just sitting there doing a poll() on just one file descriptor and
waiting for info.  I don't think it gets anything, so maybe it's an
subtle protocol error, where it's not getting a shutdown message from
the other client, or it doesn't have a good heuristic that says:

    if I don't get *any* info after X seconds, just die

where X would be something like 900 or 1200 seconds, which seems like
a reasonable number.  Now this of course would only kick after data
has been transfering for a while, since the initial work on a large
directory could take quite a while.  

This is really starting to bug me, so I'd like to try and get this
solved if we could.  Unfortuately, I'm not a great programmer, I'm
more of a tweaker hacker type.

Thanks,
John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
         [EMAIL PROTECTED] - http://www.lucent.com - 978-952-7548

Reply via email to