Hi, We're encountering a rather weird, sporadic hang during a large-file transfer. A bit of background on the setup:
- Host A rsyncs a MySQL database from Host B. Both sides use rsync 3.1.1 on Debian Jessie. - Host B serves the files using rsync --daemon from an ext4 filesystem residing on a thin LVM snapshot. - Host A uses the following command to rsync the files to a btrfs subvolume: rsync -og -avPH --log-file=/var/log/mysql-rsync.log --debug=ALL --chown=user:user --delete --inplace rsync://hostb.example.com/mysql/ /media/btrfs/db-snapshot-$$/mysql (I know parts of the command (-og + -a + --chown) don't make much sense :) - Both hosts sit on the same LAN, connected to a Juniper EX4200 stack with active-backup interfaces. MTU is standard 1500 and there are no known network issues. Host A has Intel I350 (igb) NICs, while Host B has BCM5709 (bnx2) NICs. Both hosts run Linux 3.16.36. Once every few (~3-5) transfers, the rsync process hangs with both sides waiting on select() on the TCP socket. From the transfer logs I can see that the hang always happens when about to receive the same 32GB file, already a couple of GBs into the total transfer. Host A reports: .... 04:41:03 [6281] [receiver] send_msg_int(100, 1701) 04:41:03 [6281] recv_files(mysql/innodb_table_stats.ibd) 04:41:03 [6281] sender finished //mysql/innodb_table_stats.ibd 04:41:03 [6281] send_files(1749, //mysql/user.MYD) 04:41:03 [6281] sender finished //mysql/user.MYD 04:41:03 [6281] send_files(1750, //mysql/user.MYI) 04:41:03 [6281] sender finished //mysql/user.MYI 04:41:03 [6281] send_files(1806, //some_dbase) 04:41:03 [6281] send_files(1808, //some_dbase/actions.ibd) 04:41:03 [6281] >f.st...... mysql/innodb_table_stats.ibd 04:41:03 [6281] finishing mysql/innodb_table_stats.ibd 04:41:03 [6281] set modtime of mysql/innodb_table_stats.ibd to (1491343251) Wed Apr 5 01:00:51 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1703) 04:41:03 [6281] recv_files(mysql/user.MYD) 04:41:03 [6281] >f.st...... mysql/user.MYD 04:41:03 [6281] finishing mysql/user.MYD 04:41:03 [6281] set modtime of mysql/user.MYD to (1487686529) Tue Feb 21 16:15:29 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1749) 04:41:03 [6281] recv_files(mysql/user.MYI) 04:41:03 [6281] >f..t...... mysql/user.MYI 04:41:03 [6281] finishing mysql/user.MYI 04:41:03 [6281] set modtime of mysql/user.MYI to (1487689840) Tue Feb 21 17:10:40 2017 04:41:03 [6281] [receiver] send_msg_int(100, 1750) 04:41:03 [6281] recv_files(some_dbase) 04:41:03 [6281] .d..t...... some_dbase/ <silence> while Host B reports: .... 01:41:03 [3726] send_files(1749, //mysql/user.MYD) 01:41:03 [3726] <f.st...... mysql/user.MYD 01:41:03 [3726] sender finished //mysql/user.MYD 01:41:03 [3726] send_files(1750, //mysql/user.MYI) 01:41:03 [3726] <f..t...... mysql/user.MYI 01:41:03 [3726] sender finished //mysql/user.MYI 01:41:03 [3726] send_files(1806, //some_dbase) 01:41:03 [3726] .d..t...... some_dbase/ 01:41:03 [3726] send_files(1808, //some_dbase/actions.ibd) <silence> The backtraces of the hung processes are: Host A (receiver): #0 0x00007f87d7da3873 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f87d88dd7a2 in perform_io (needed=needed@entry=4, flags=flags@entry=17) at io.c:742 #2 0x00007f87d88deef7 in raw_read_int () at io.c:908 #3 0x00007f87d88df683 in read_a_msg () at io.c:1423 #4 0x00007f87d88e0676 in read_buf (f=f@entry=4, buf=buf@entry=0x7ffd814c0df0 "\300\037L\201\375\177", len=len@entry=1) at io.c:1835 #5 0x00007f87d88e157b in read_ndx (f=f@entry=4) at io.c:2222 #6 0x00007f87d88be377 in read_ndx_and_attrs (f_in=4, f_out=6, iflag_ptr=0x7ffd814c0f28, type_ptr=0x7ffd814c0f27 "\200\b", buf=0x7ffd814c1fc0 "", len_ptr=0x7ffd814c0f2c) at rsync.c:320 #7 0x00007f87d88c5b5e in recv_files (f_in=5, f_out=-2125728480, local_name=0x7ffd814c0fc0 "some_dbase") at receiver.c:548 #8 0x00007f87d88d0787 in do_recv (f_in=4, f_out=6, local_name=0x0) at main.c:906 #9 0x00007f87d88d1218 in client_run (f_in=4, f_out=4, pid=-1, argc=<optimized out>, argv=<optimized out>) at main.c:1213 #10 0x00007f87d88b4330 in start_client (argv=<optimized out>, argc=<optimized out>) at main.c:1382 #11 main (argc=2, argv=0x7f87d9000240) at main.c:1651 Host B (sender): #0 0x00007f9a7130a873 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f9a71e447a2 in perform_io (needed=needed@entry=4, flags=flags@entry=17) at io.c:742 #2 0x00007f9a71e45e2e in raw_read_buf (buf=0x7ffe7c1aa380 "", len=4) at io.c:895 #3 0x00007f9a71e47643 in read_buf (f=<optimized out>, buf=buf@entry=0x7ffe7c1aa380 "", len=len@entry=4) at io.c:1840 #4 0x00007f9a71e47851 in read_int (f=<optimized out>) at io.c:1711 #5 0x00007f9a71e2f652 in receive_sums (f=<optimized out>) at sender.c:96 #6 send_files (f_in=4, f_out=1930300096) at sender.c:332 #7 0x00007f9a71e37f6c in do_server_sender (argv=<optimized out>, argc=<optimized out>, f_out=3, f_in=3) at main.c:840 #8 start_server (f_in=3, f_out=3, argc=2, argv=0x7f9a7302f240) at main.c:1107 #9 0x00007f9a71e5a34e in rsync_module (f_in=3, f_out=3, i=0, addr=0x7f9a72098518 <numeric_ids> "\377\377\377\377", host=0x7f9a72098c80 <name_buf> "host-a.example.com") at clientserver.c:1009 #10 0x00007f9a71e5b260 in start_daemon (f_in=3, f_out=3) at clientserver.c:1100 #11 0x00007f9a71e4f5d1 in start_accept_loop (port=4, fn=0x7ffe7c1aa160, fn@entry=0x7f9a71e5b120 <start_daemon>) at socket.c:618 #12 0x00007f9a71e5bb46 in daemon_main () at clientserver.c:1202 #13 0x00007f9a71e1aeb1 in main (argc=0, argv=0x0) at main.c:1626 The worrysome part is the Send-Q on Host A, which last time contained 263392 unACK'd bytes. The rest of the queues involved were empty. tcpdump reveals no retransmission attempts, only 2 window updates and a stream of TCP keepalives after the point where the transfer halts: Bytes Time (s) SRC DST Proto Details 12995 1.637946 hostb hosta RSYNC Server MOTD 86 1.637963 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37686543 Win=23993 Len=0 TSval=2255190023 TSecr=100923227 86 1.638406 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37686543 Ack=1 Win=1688 Len=0 TSval=100923227 TSecr=2255190022 1514 1.641767 hostb hosta RSYNC Server MOTD 86 1.641802 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37687971 Win=23985 Len=0 TSval=2255190024 TSecr=100923228 1381 1.641938 hostb hosta RSYNC Server MOTD 86 1.641955 hosta hostb TCP 39115→rsync(873) [ACK] Seq=1 Ack=37689266 Win=23977 Len=0 TSval=2255190024 TSecr=100923228 86 1.647888 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=6840 Len=0 TSval=100923230 TSecr=2255190024 86 121.931963 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255220097 TSecr=100923230 86 121.932062 hostb hosta TCP [TCP Window Update] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100953302 TSecr=2255190024 86 242.248005 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255250176 TSecr=100953302 86 242.248090 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=100983381 TSecr=2255190024 86 362.567974 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255280256 TSecr=100983381 86 362.568085 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101013461 TSecr=2255190024 86 482.891958 hosta hostb TCP [TCP Keep-Alive] 39115→rsync(873) [ACK] Seq=0 Ack=37689266 Win=24576 Len=0 TSval=2255310336 TSecr=101013461 86 482.892026 hostb hosta TCP [TCP Keep-Alive ACK] rsync(873)→39115 [ACK] Seq=37689266 Ack=1 Win=8467 Len=0 TSval=101043542 TSecr=2255190024 >From the backtraces it appears as though the sender is waiting for the checksums the client has sent, while the client is waiting for the next file id to come in. I can only assume the deadlock is because of the client's checksums sitting in its Send-Q. The only potentially related part I can find in the traffic captures are a couple of Zero Window responses by hostb, which however are followed by window updates with non-zero sizes. Any ideas or hints what else to look for? Regards, Apollon -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html