On Wed, Apr 13, 2016 at 06:19:41PM +0200, Michal Hocko wrote:
> Package: rsync
> Version: 3.1.1-3
> Severity: normal
>
> Hi,
> I have hit a strange lockup when rsyncing to an nfs storage mounted
> localy. At first I though it would be a network related problem but
> further investigation indicates otherwise.
>
> rsync -avz SOURCE_DIR TARGET_DIR spawns 2 children processes
> 31043 pts/2 S+ 0:25 rsync -azv --progress SOURCE_DIR
> TARGET_DIR
> 31044 pts/2 S+ 0:03 rsync -azv --progress SOURCE_DIR
> TARGET_DIR
> 31045 pts/2 S+ 0:03 rsync -azv --progress SOURCE_DIR
> TARGET_DIR
>
> I have tried to strace all three (they run under root) as root.
> # strace -tt -T -p 31043
> strace: attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
> # strace -tt -T -p 31044
> Process 31044 attached
> 11:48:30.309529 select(2, [], [1], [], {29, 878783}) = 0 (Timeout) <29.908793>
> 11:49:00.218621 select(2, [], [1], [], {60, 0}) = 0 (Timeout) <60.039005>
> 11:50:00.257837 select(2, [], [1], [], {60, 0}) = 0 (Timeout) <60.043860>
>
> So it is waiting to write to fd 1 which is
> # ls -l /proc/31044/fd/1
> lrwx------ 1 root root 64 Apr 13 11:48 /proc/31044/fd/1 -> socket:[151908]
> # lsof -p 31044 | grep -w 151908
> rsync 31044 root 1u unix 0xffff8801bea58b80 0t0 151908 socket
> # ss -nxp | grep 151908
> u_str ESTAB 0 68951 * 151908 *
> 151907 users:(("rsync",pid=31044,fd=1))
> u_str ESTAB 64049 0 * 151907 *
> 151908 users:(("rsync",pid=31043,fd=5))
>
> So it is a unix socket (most probably socketpair) to the first rsync
> task. I have only briefly checked the code and it seems to be
> fd_pair(from_child_pipe) because it matches stdout fd.
>
> The other task is in a similar state
> # strace -tt -T -p 31045
> Process 31045 attached
> 17:50:33.740635 select(8, [], [7], [], {52, 42706}) = 0 (Timeout) <52.094874>
> 17:51:25.835829 select(8, [], [7], [], {60, 0}) = 0 (Timeout) <60.060079>
> 17:52:25.895988 select(8, [], [7], [], {60, 0}^CProcess 31045 detached
> <detached ...>
> # ls -l /proc/31045/fd/7
> lrwx------ 1 root root 64 Apr 13 12:00 /proc/31045/fd/7 -> socket:[151917]
> # lsof -p 31045 | grep 151917
> rsync 31045 root 7u unix 0xffff88041587fbc0 0t0 151917 socket
> # ss -nxp | grep 151917
> u_str ESTAB 13052 0 * 151916 *
> 151917 users:(("rsync",pid=31044,fd=6))
> u_str ESTAB 0 54272 * 151917 *
> 151916 users:(("rsync",pid=31045,fd=7))
>
> except 31044 is on the other end which cannot read from the socket
> buffer because it is stuck trying to write to its parent (safe_write
> seems to be retrying endlessly). So it all looks like the first tasks
> is blocking the remaining two. I wish I could find out more about the
> primary process but all I can see is
OK, found the reason I cannot strace the first task (another admin has
left his strace running...). Anyway this looks like a deadlock.
# strace -p 31043
Process 31043 attached
select(5, [], [4], [], {31, 179565}) = 0 (Timeout)
select(5, [], [4], [], {60, 0}) = 0 (Timeout)
# ls -l /proc/31043/fd/4
lrwx------ 1 root root 64 Apr 14 09:35 /proc/31043/fd/4 -> socket:[151906]
# lsof -p 31043 | grep 151906
rsync 31043 root 4u unix 0xffff8801a41a9b80 0t0 151906 socket
# ss -nxp | grep 151906
u_str ESTAB 18465 0 * 151905 *
151906 users:(("rsync",pid=31045,fd=0))
u_str ESTAB 0 53760 * 151906 *
151905 users:(("rsync",pid=31043,fd=4))
To summarize:
31043 wants to write to a socket with 31045 on the other end but the socket
buffer is most probably full, because select would tell us otherwise.
31045 wants to write to a socket with 31044 on the other end and cannot
proceed as well.
31044 wants to write to a socket with 31043 on the other end and we are
in the waiting loop. So this looks like a deadlock unless I have screwed
something in the above.
--
Michal Hocko