Hi all: I wrote some time back about a hang issue when using backup pc over rsync/ssh between two linux boxes. Well I am having it again. The environment is Centos 5.2 on the servers and centos 5 on the client. rpm rsync-2.6.9-1 on client. The host is backed up to two servers running:
BackupPC verison 3.1.0 BackupPC 3.2 beta0 The last successful backup of the server on either system was an incremental (level 2, level 1 resp) on 7/30/2009. Basically it looks like there is a deadlock occurring somewhere between the server and client backuppc processes. I am using the following mechanism to try to figure out what's happening. In pc/up01.bos1.renesys.com.pl I added: $Conf{XferLogLevel} = '5'; $Conf{PingPath} = '/bin/true'; $Conf{RsyncClientCmd} = '$sshPath -q -x -l backup -o ServerAliveInterval=30 $host sudo /usr/bin/strace -o /tmp/strace.$$ -f -tt -T $rsyncPath $argList+'; (RsyncClientPath is all one long line, wrapped for email). Note that PingPath is not /bin/ping because setuid binaries (like ping) don't work under strace -f. The result of this setup is to run /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system --bwlimit=62 --checksum-seed=32761 --ignore-times . /etc/ on the remote host up01. Then I invoke: /usr/bin/strace -u backup -f -tt -T -o /var/tmp/strace.BackupPC_dump /tools/BackupPC/bin/BackupPC_dump -f -v up01.bos1.renesys.com on my backuppc host. The verbose output from BackupPC_dump proceeds normal until: /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fyum/fyum-updatesd.conf cache = 1, invalid = , phase = 0 Sending csums for yum/yum-updatesd.conf (size=490) /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzlogin cache = 1, invalid = , phase = 0 Sending csums for zlogin (size=252) /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzlogout cache = 1, invalid = , phase = 0 Sending csums for zlogout (size=86) /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzprofile cache = 1, invalid = , phase = 0 Sending csums for zprofile (size=168) /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshenv cache = 1, invalid = , phase = 0 Sending csums for zshenv (size=304) /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshrc cache = 1, invalid = , phase = 0 Sending csums for zshrc (size=640) Sending csums, cnt = 1, phase = 1 /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow- verify; cached = 1, invalid = , phase = 1 gshadow-: verified cached digest /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow- cache = 1, invalid = , phase = 1 Sending csums for gshadow- (size=435) and that's it. Strace output of the client rsync process (on up01) shows: 31338 19:42:42.014776 open("/etc/zlogout", O_RDONLY) = 3 <0.000008> 31338 19:42:42.014814 fstat(3, {st_mode=S_IFREG|0644, st_size=86, ...}) = 0 <0.000006> 31338 19:42:42.014920 read(3, "#\n#\n# /etc/zlogout and ~/.zlogou"..., 86) = 86 <0.000009> 31338 19:42:42.014963 close(3) = 0 <0.000006> 31338 19:42:42.014998 open("/etc/zprofile", O_RDONLY) = 3 <0.000009> 31338 19:42:42.015035 fstat(3, {st_mode=S_IFREG|0644, st_size=168, ...}) = 0 <0.000005> 31338 19:42:42.015139 read(3, "#\n# /etc/zprofile and ~/.zprofil"..., 168) = 168 <0.000009> 31338 19:42:42.015183 close(3) = 0 <0.000007> 31338 19:42:42.015217 open("/etc/zshenv", O_RDONLY) = 3 <0.000008> 31338 19:42:42.015255 fstat(3, {st_mode=S_IFREG|0644, st_size=304, ...}) = 0 <0.000006> 31338 19:42:42.015358 read(3, "#\n# /etc/zshenv is sourced on al"..., 304) = 304 <0.000008> 31338 19:42:42.015402 close(3) = 0 <0.000006> 31338 19:42:42.015437 open("/etc/zshrc", O_RDONLY) = 3 <0.000008> 31338 19:42:42.015475 fstat(3, {st_mode=S_IFREG|0644, st_size=640, ...}) = 0 <0.000006> 31338 19:42:42.015579 read(3, "#\n# /etc/zshrc is sourced in int"..., 640) = 640 <0.000010> 31338 19:42:42.015642 close(3) = 0 <0.000006> 31338 19:42:42.015678 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) <0.000008> 31338 19:42:42.015728 write(1, "l\5\0\7", 4) = 4 <0.000078> 31338 19:42:42.015851 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) <0.000006> 31338 19:42:42.015899 write(1, "\231\201b\220\203\350_\346\207\6\0\0\1\0\0\0\0\1 0\0\0\2\0\0\0F\1\0\0\377\377\377\377"..., 1388) = 1388 <0.000076> 31338 19:42:42.016018 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) <0.000007> 31338 19:42:42.016071 write(1, "\4\0\0\7\377\377\377\377", 8) = 8 <0.000025> 31338 19:42:42.016129 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) <59.988367> 31338 19:43:42.004567 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) <59.988089> 31338 19:44:41.992744 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) <59.987474> Note that it transfers the /etc/zshrc file fine. I have seen this before and it is basically hung until the SIGALRM timeout (at 20 hours) kicks in. Nothing ever comes to it on stdin (fd 0). The lsof output of the rsync process is: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME rsync 31338 root cwd DIR 253,2 4096 357281 /home/backup rsync 31338 root rtd DIR 253,0 4096 2 / rsync 31338 root txt REG 253,1 300432 1624023 /usr/bin/rsync rsync 31338 root mem REG 253,0 130304 225404 /lib64/ld-2.5.so rsync 31338 root mem REG 253,0 1687464 225405 /lib64/libc-2.5.so rsync 31338 root mem REG 253,1 34240 2047188 /usr/lib64/libpopt.so.0.0.0 rsync 31338 root 0u unix 0xffff8101178a29c0 1345341767 socket rsync 31338 root 1u unix 0xffff8101178a29c0 1345341767 socket rsync 31338 root 2u unix 0xffff81007ce3e8c0 1345341769 socket Note that fd 3 is usually the file being transferred by the rsync. That's how I can tell how far it has progressed. Looking on the backuppc server side, the process tree looks like: strace(8005)---BackupPC_dump(8007)-+-BackupPC_dump(8024) `-ssh(8023) the strace output from the child BackupPC_dump 8024 (which I assume is talking to the rsync on up01) is: 8024 19:42:42.015114 read(11, "l\5\0\7", 65536) = 4 <0.000012> 8024 19:42:42.015229 alarm(72000) = 72000 <0.000010> 8024 19:42:42.015294 select(16, [11], NULL, [11], NULL <unfinished ...> 8024 19:42:42.015541 <... select resumed> ) = 1 (in [11]) <0.000222> 8024 19:42:42.015593 read(11, <unfinished ...> 8024 19:42:42.015666 <... read resumed> "\231\201b\220\203\350_\346\207\6\0\0\1\0\0\0\0\10\0\0\2\0\0\0F\1\0\0\377\377\377\377"..., 65536) = 1388 <0.000039> 8024 19:42:42.015769 alarm(72000 <unfinished ...> 8024 19:42:42.015831 <... alarm resumed> ) = 72000 <0.000041> 8024 19:42:42.015890 select(16, [11], NULL, [11], NULL <unfinished ...> 8024 19:42:42.015956 <... select resumed> ) = 1 (in [11]) <0.000040> 8024 19:42:42.016009 read(11, <unfinished ...> 8024 19:42:42.016068 <... read resumed> "\4\0\0\7\377\377\377\377", 65536) = 8 <0.000040> 8024 19:42:42.016112 alarm(72000) = 72000 <0.000009> 8024 19:42:42.016168 alarm(72000) = 72000 <0.000008> 8024 19:42:42.016218 select(16, [11], NULL, [11], NULL <unfinished ...> 8024 19:54:35.090490 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) <713.074237> 8024 19:54:35.090563 --- SIGWINCH (Window changed) @ 0 (0) --- 8024 19:54:35.090607 select(16, [11], NULL, [11], NULL <unfinished ...> 8024 19:54:35.121181 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) <0.030543> with the select on fd 11 just continuing. (The sigwinch is due to my changing the window size ox my xterm to display the ps tree output properly.) The strace of the parent BackupPC_dump is: 8007 19:42:35.297051 open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshrc", O_RDONLY|O_LARGEFILE) = 8 <0.000015> 8007 19:42:35.297142 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device) <0.000006> 8007 19:42:35.297184 _llseek(8, 0, [0], SEEK_CUR) = 0 <0.000006> 8007 19:42:35.297226 fstat64(8, {st_mode=S_IFREG|0640, st_size=468, ...}) = 0 <0.000006> 8007 19:42:35.297300 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 8007 19:42:35.297341 read(8, "\327^]RMo\0231\20=\307\277\342\21\357*\200\232D\345ZE\342\10R\21U\213\340P\365\340"..., 4096) = 468 <0.000008> 8007 19:42:35.297444 write(2, "/data/backup-pc/pc/up01.bos1.ren"..., 93) = 93 <0.000009> 8007 19:42:35.297576 write(2, "Sending csums for zshrc (size=64"..., 35) = 35 <0.000009> 8007 19:42:35.297722 close(8) = 0 <0.000007> 8007 19:42:35.297781 select(8, [7], NULL, [7], {0, 0}) = 0 (Timeout) <0.000007> 8007 19:42:35.297848 alarm(72000) = 71999 <0.000011> 8007 19:42:35.297911 select(16, [7], [10], [7 10], NULL) = 1 (out [10]) <0.000008> 8007 19:42:35.297981 write(10, "\253Fx<\34136F]s\326tH8=y\350\2\3772\214\305\10\266\307W:\253\243\317NC"..., 17840 <unfinished ...> 8007 19:42:35.298792 <... write resumed> ) = 17840 <0.000772> 8007 19:42:35.298897 write(2, "Sending csums, cnt = 1, phase = "..., 34) = 34 <0.000010> 8007 19:42:35.299150 open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-", O_RDONLY|O_LARGEFILE) = 8 <0.000018> 8007 19:42:35.299243 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device) <0.000006> 8007 19:42:35.299287 _llseek(8, 0, [0], SEEK_CUR) = 0 <0.000006> 8007 19:42:35.299330 fstat64(8, {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000006> 8007 19:42:35.299406 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 8007 19:42:35.299448 read(8, "\327^m\220A\222\304 \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G \251\232\305,"..., 4096) = 281 <0.000008> 8007 19:42:35.299552 write(2, "/data/backup-pc/pc/up01.bos1.ren"..., 105) = 105 <0.000009> 8007 19:42:35.299702 stat64("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-", {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000013> 8007 19:42:35.299847 open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-", O_RDONLY|O_LARGEFILE) = 9 <0.000013> 8007 19:42:35.299920 ioctl(9, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device) <0.000006> 8007 19:42:35.299961 _llseek(9, 0, [0], SEEK_CUR) = 0 <0.000005> 8007 19:42:35.300003 fstat64(9, {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000006> 8007 19:42:35.300077 fcntl64(9, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 8007 19:42:35.300196 read(9, "\327^m\220A\222\304 \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G \251\232\305,"..., 131072) = 281 <0.000008> 8007 19:42:35.300365 _llseek(9, -69, [212], SEEK_CUR) = 0 <0.000006> 8007 19:42:35.300451 _llseek(9, 0, [212], SEEK_CUR) = 0 <0.000006> 8007 19:42:35.300509 close(9) = 0 <0.000007> 8007 19:42:35.300569 open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-", O_RDWR|O_LARGEFILE) = 9 <0.000015> 8007 19:42:35.300668 ioctl(9, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf86b4d8) = -1 ENOTTY (Inappropriate ioctl for device) <0.000007> 8007 19:42:35.300711 _llseek(9, 0, [0], SEEK_CUR) = 0 <0.000006> 8007 19:42:35.300753 fstat64(9, {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000007> 8007 19:42:35.300828 fcntl64(9, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 8007 19:42:35.300873 read(9, "\327", 1) = 1 <0.000007> 8007 19:42:35.300920 _llseek(9, 212, [212], SEEK_SET) = 0 <0.000006> 8007 19:42:35.300980 read(9, "\263\210\2208\352\341s\252\31...@\25\272\315\207\233\214\r\213\0309\1o)\32\255\303#\327\17\210"..., 70) = 69 <0.000007> 8007 19:42:35.301051 close(9) = 0 <0.000007> 8007 19:42:35.301116 write(2, "gshadow-: verified cached digest"..., 33) = 33 <0.000009> 8007 19:42:35.301243 close(8) = 0 <0.000006> 8007 19:42:35.301320 open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-", O_RDONLY|O_LARGEFILE) = 8 <0.000015> 8007 19:42:35.301396 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device) <0.000006> 8007 19:42:35.301437 _llseek(8, 0, [0], SEEK_CUR) = 0 <0.000005> 8007 19:42:35.301480 fstat64(8, {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000006> 8007 19:42:35.301555 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0 <0.000006> 8007 19:42:35.301612 read(8, "\327^m\220A\222\304 \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G \251\232\305,"..., 4096) = 281 <0.000009> 8007 19:42:35.301725 write(2, "/data/backup-pc/pc/up01.bos1.ren"..., 96) = 96 <0.000010> 8007 19:42:35.301868 write(2, "Sending csums for gshadow- (size"..., 38) = 38 <0.000010> 8007 19:42:35.302024 close(8) = 0 <0.000008> 8007 19:42:35.302103 select(8, [7], NULL, [7], NULL <unfinished ...> 8007 19:54:35.090584 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) <719.788455> 8007 19:54:35.090662 --- SIGWINCH (Window changed) @ 0 (0) --- 8007 19:54:35.090711 select(8, [7], NULL, [7], NULL <unfinished ...> 8007 19:54:35.121243 <... select resumed> ) = ? ERESTARTNOHAND (To be restarted) <0.030506> with the select on fd 7 continuing. The lsof of the parent BackupPC process (8007) is: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME [...] BackupPC_ 8007 backup mem REG 253,1 8808 1400404 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/Cwd/Cwd.so BackupPC_ 8007 backup mem REG 253,0 3103092 1063539 /lib/libnss_ldap-2.5.so BackupPC_ 8007 backup DEL REG 253,4 1981333 /var/run/nscd/dbI54MYz BackupPC_ 8007 backup DEL REG 253,4 1981330 /var/run/nscd/dbPr0Bwh BackupPC_ 8007 backup mem REG 253,1 56413488 1401690 /usr/lib/locale/locale-archive BackupPC_ 8007 backup 0u CHR 136,3 5 /dev/pts/3 BackupPC_ 8007 backup 1w FIFO 0,6 2095041895 pipe BackupPC_ 8007 backup 2w FIFO 0,6 2095041895 pipe BackupPC_ 8007 backup 3u IPv4 2095041985 TCP ops01.fp.bos1.renesys.com:46753->bos1.auth.staff.renesys.com:ldaps (ESTABLISHED) BackupPC_ 8007 backup 4w REG 9,2 8597 39944901 /backup-pc/pc/up01.bos1.renesys.com/LOG.082009 BackupPC_ 8007 backup 5w REG 9,2 0 39946542 /backup-pc/pc/up01.bos1.renesys.com/XferLOG.z BackupPC_ 8007 backup 6w REG 9,2 0 39946835 /backup-pc/pc/up01.bos1.renesys.com/NewFileList BackupPC_ 8007 backup 7u unix 0xf7685580 2095042158 socket BackupPC_ 8007 backup 10u unix 0xf7685c80 2095042115 socket The lsof of pid 8024: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME [...] BackupPC_ 8024 backup mem REG 253,1 8808 1400404 /usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/Cwd/Cwd.so BackupPC_ 8024 backup mem REG 253,0 3103092 1063539 /lib/libnss_ldap-2.5.so BackupPC_ 8024 backup DEL REG 253,4 1981333 /var/run/nscd/dbI54MYz BackupPC_ 8024 backup DEL REG 253,4 1981330 /var/run/nscd/dbPr0Bwh BackupPC_ 8024 backup mem REG 253,1 56413488 1401690 /usr/lib/locale/locale-archive BackupPC_ 8024 backup 0u CHR 136,3 5 /dev/pts/3 BackupPC_ 8024 backup 1w FIFO 0,6 2095041895 pipe BackupPC_ 8024 backup 2w FIFO 0,6 2095041895 pipe BackupPC_ 8024 backup 3u IPv4 2095041985 TCP ops01.fp.bos1.renesys.com:46753->bos1.auth.staff.renesys.com:ldaps (ESTABLISHED) BackupPC_ 8024 backup 4w REG 9,2 8597 39944901 /backup-pc/pc/up01.bos1.renesys.com/LOG.082009 BackupPC_ 8024 backup 5w REG 9,2 0 39946542 /backup-pc/pc/up01.bos1.renesys.com/XferLOG.z BackupPC_ 8024 backup 6w REG 9,2 0 39946835 /backup-pc/pc/up01.bos1.renesys.com/NewFileList BackupPC_ 8024 backup 9u unix 0xcfb97040 2095042159 socket BackupPC_ 8024 backup 11u unix 0xf7685c80 2095042115 socket The lsof of the ssh (pid 8023) shows: COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME [...] ssh 8023 backup mem REG 253,0 7748 1061368 /lib/libcom_err.so.2.1 ssh 8023 backup mem REG 253,1 157304 1405482 /usr/lib/libk5crypto.so.3.1 ssh 8023 backup DEL REG 253,4 1981333 /var/run/nscd/dbI54MYz ssh 8023 backup DEL REG 253,4 1981330 /var/run/nscd/dbPr0Bwh ssh 8023 backup 0u unix 0xf7685740 2095042114 socket ssh 8023 backup 1u unix 0xf7685740 2095042114 socket ssh 8023 backup 2u unix 0xf7685740 2095042114 socket ssh 8023 backup 3u IPv4 2095042155 TCP ops01.fp.bos1.renesys.com:48563->up01.fp.bos1.renesys.com:ssh (ESTABLISHED) ssh 8023 backup 4u unix 0xf7685740 2095042114 socket ssh 8023 backup 5u unix 0xf7685740 2095042114 socket ssh 8023 backup 6u unix 0xf7685740 2095042114 socket The strace of the sshd does show data transfer occurring, but I think it it ssh procol keepalive etc messages and not actual data. I have managed to "solve" this in the past by: * Using a full rather than incremental backup. That didn't work this time. * Blowing away prior completed backups to get back to a state where the new backups no longer hung. Haven't tried that yet. So does anybody have any ideas what could be wrong, what my next steps to get debugging info should be? -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-244-9084 (cell) 603-643-9300 x 111 ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/