On Wed, Apr 13, 2016 at 06:19:41PM +0200, Michal Hocko wrote:
> Package: rsync
> Version: 3.1.1-3
> Severity: normal
> 
> Hi,
> I have hit a strange lockup when rsyncing to an nfs storage mounted
> localy. At first I though it would be a network related problem but
> further investigation indicates otherwise.
> 
> rsync -avz SOURCE_DIR TARGET_DIR spawns 2 children processes
> 31043 pts/2    S+     0:25           rsync -azv --progress SOURCE_DIR 
> TARGET_DIR
> 31044 pts/2    S+     0:03             rsync -azv --progress SOURCE_DIR 
> TARGET_DIR
> 31045 pts/2    S+     0:03               rsync -azv --progress SOURCE_DIR 
> TARGET_DIR
> 
> I have tried to strace all three (they run under root) as root.
> # strace -tt -T -p 31043
> strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
> # strace -tt -T -p 31044
> Process 31044 attached
> 11:48:30.309529 select(2, [], [1], [], {29, 878783}) = 0 (Timeout) <29.908793>
> 11:49:00.218621 select(2, [], [1], [], {60, 0}) = 0 (Timeout) <60.039005>
> 11:50:00.257837 select(2, [], [1], [], {60, 0}) = 0 (Timeout) <60.043860>
> 
> So it is waiting to write to fd 1 which is
> # ls -l /proc/31044/fd/1
> lrwx------ 1 root root 64 Apr 13 11:48 /proc/31044/fd/1 -> socket:[151908]
> # lsof -p 31044 | grep -w 151908
> rsync   31044 root    1u  unix 0xffff8801bea58b80      0t0  151908 socket
> # ss -nxp | grep 151908
> u_str  ESTAB      0      68951                  * 151908                 * 
> 151907 users:(("rsync",pid=31044,fd=1))
> u_str  ESTAB      64049  0                      * 151907                 * 
> 151908 users:(("rsync",pid=31043,fd=5))
> 
> So it is a unix socket (most probably socketpair) to the first rsync
> task.  I have only briefly checked the code and it seems to be
> fd_pair(from_child_pipe) because it matches stdout fd.
> 
> The other task is in a similar state
> # strace -tt -T -p 31045
> Process 31045 attached
> 17:50:33.740635 select(8, [], [7], [], {52, 42706}) = 0 (Timeout) <52.094874>
> 17:51:25.835829 select(8, [], [7], [], {60, 0}) = 0 (Timeout) <60.060079>
> 17:52:25.895988 select(8, [], [7], [], {60, 0}^CProcess 31045 detached
>  <detached ...>
> # ls -l /proc/31045/fd/7
> lrwx------ 1 root root 64 Apr 13 12:00 /proc/31045/fd/7 -> socket:[151917]
> # lsof -p 31045 | grep 151917
> rsync   31045 root    7u  unix 0xffff88041587fbc0      0t0  151917 socket
> # ss -nxp | grep 151917
> u_str  ESTAB      13052  0                      * 151916                 * 
> 151917 users:(("rsync",pid=31044,fd=6))
> u_str  ESTAB      0      54272                  * 151917                 * 
> 151916 users:(("rsync",pid=31045,fd=7))
> 
> except 31044 is on the other end which cannot read from the socket
> buffer because it is stuck trying to write to its parent (safe_write
> seems to be retrying endlessly). So it all looks like the first tasks
> is blocking the remaining two. I wish I could find out more about the
> primary process but all I can see is

OK, found the reason I cannot strace the first task (another admin has
left his strace running...). Anyway this looks like a deadlock.
# strace -p 31043
Process 31043 attached
select(5, [], [4], [], {31, 179565})    = 0 (Timeout)
select(5, [], [4], [], {60, 0})         = 0 (Timeout)
# ls -l /proc/31043/fd/4
lrwx------ 1 root root 64 Apr 14 09:35 /proc/31043/fd/4 -> socket:[151906]
# lsof -p 31043 | grep 151906
rsync   31043 root    4u  unix 0xffff8801a41a9b80      0t0   151906 socket
# ss -nxp | grep 151906
u_str  ESTAB      18465  0                      * 151905                 * 
151906 users:(("rsync",pid=31045,fd=0))
u_str  ESTAB      0      53760                  * 151906                 * 
151905 users:(("rsync",pid=31043,fd=4))

To summarize:
31043 wants to write to a socket with 31045 on the other end but the socket
buffer is most probably full, because select would tell us otherwise.
31045 wants to write to a socket with 31044 on the other end and cannot
proceed as well.
31044 wants to write to a socket with 31043 on the other end and we are
in the waiting loop. So this looks like a deadlock unless I have screwed
something in the above.
-- 
Michal Hocko

Reply via email to