Hi all:

I wrote some time back about a hang issue when using backup pc over
rsync/ssh between two linux boxes. Well I am having it again. The
environment is Centos 5.2 on the servers and centos 5 on the
client. rpm rsync-2.6.9-1 on client. The host is backed up to two
servers running:

  BackupPC verison 3.1.0
  BackupPC 3.2 beta0

The last successful backup of the server on either system was an
incremental (level 2, level 1 resp) on 7/30/2009.

Basically it looks like there is a deadlock occurring somewhere
between the server and client backuppc processes.

I am using the following mechanism to try to figure out what's
happening. In pc/up01.bos1.renesys.com.pl I added:
  
  $Conf{XferLogLevel} = '5';

  $Conf{PingPath} = '/bin/true';

  $Conf{RsyncClientCmd} = '$sshPath -q -x -l backup -o
     ServerAliveInterval=30 $host sudo /usr/bin/strace -o /tmp/strace.$$ -f
     -tt -T $rsyncPath $argList+';

(RsyncClientPath is all one long line, wrapped for email). Note that
PingPath is not /bin/ping because setuid binaries (like ping)
don't work under strace -f. The result of this setup is to run

  /usr/bin/rsync --server --sender --numeric-ids --perms --owner
    --group -D --links --hard-links --times --block-size=2048
    --recursive --one-file-system --bwlimit=62
    --checksum-seed=32761 --ignore-times . /etc/

on the remote host up01.

Then I invoke:

  /usr/bin/strace -u backup -f -tt -T -o /var/tmp/strace.BackupPC_dump
   /tools/BackupPC/bin/BackupPC_dump -f -v up01.bos1.renesys.com

on my backuppc host.

The verbose output from BackupPC_dump proceeds normal until:

  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fyum/fyum-updatesd.conf
    cache = 1, invalid = , phase = 0
  Sending csums for yum/yum-updatesd.conf (size=490)
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzlogin
    cache = 1, invalid = , phase = 0
  Sending csums for zlogin (size=252)
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzlogout
    cache = 1, invalid = , phase = 0
  Sending csums for zlogout (size=86)
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzprofile
    cache = 1, invalid = , phase = 0
  Sending csums for zprofile (size=168)
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshenv
    cache = 1, invalid = , phase = 0
  Sending csums for zshenv (size=304)
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshrc cache
    = 1, invalid = , phase = 0
  Sending csums for zshrc (size=640)
  Sending csums, cnt = 1, phase = 1
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-
    verify; cached = 1, invalid = , phase = 1
  gshadow-: verified cached digest
  /data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-
    cache = 1, invalid = , phase = 1
  Sending csums for gshadow- (size=435)
  
and that's it. Strace output of the client rsync process (on
up01) shows:

  31338 19:42:42.014776 open("/etc/zlogout", O_RDONLY) = 3
    <0.000008>
  31338 19:42:42.014814 fstat(3, {st_mode=S_IFREG|0644, st_size=86,
    ...}) = 0 <0.000006>
  31338 19:42:42.014920 read(3, "#\n#\n# /etc/zlogout and
    ~/.zlogou"..., 86) = 86 <0.000009>
  31338 19:42:42.014963 close(3)          = 0 <0.000006>
  31338 19:42:42.014998 open("/etc/zprofile", O_RDONLY) = 3
    <0.000009>
  31338 19:42:42.015035 fstat(3, {st_mode=S_IFREG|0644,
    st_size=168, ...}) = 0 <0.000005>
  31338 19:42:42.015139 read(3, "#\n# /etc/zprofile and
    ~/.zprofil"..., 168) = 168 <0.000009>
  31338 19:42:42.015183 close(3)          = 0 <0.000007>
  31338 19:42:42.015217 open("/etc/zshenv", O_RDONLY) = 3
    <0.000008>
  31338 19:42:42.015255 fstat(3, {st_mode=S_IFREG|0644,
    st_size=304, ...}) = 0 <0.000006>
  31338 19:42:42.015358 read(3, "#\n# /etc/zshenv is sourced on
    al"..., 304) = 304 <0.000008>
  31338 19:42:42.015402 close(3)          = 0 <0.000006>
  31338 19:42:42.015437 open("/etc/zshrc", O_RDONLY) = 3 <0.000008>
  31338 19:42:42.015475 fstat(3, {st_mode=S_IFREG|0644,
    st_size=640, ...}) = 0 <0.000006>
  31338 19:42:42.015579 read(3, "#\n# /etc/zshrc is sourced in
    int"..., 640) = 640 <0.000010>
  31338 19:42:42.015642 close(3)          = 0 <0.000006>
  31338 19:42:42.015678 select(2, NULL, [1], [1], {60, 0}) = 1 (out
    [1], left {60, 0}) <0.000008>
  31338 19:42:42.015728 write(1, "l\5\0\7", 4) = 4 <0.000078>
  31338 19:42:42.015851 select(2, NULL, [1], [1], {60, 0}) = 1 (out
    [1], left {60, 0}) <0.000006>
  31338 19:42:42.015899 write(1,
    "\231\201b\220\203\350_\346\207\6\0\0\1\0\0\0\0\1
     0\0\0\2\0\0\0F\1\0\0\377\377\377\377"..., 1388) = 1388 <0.000076>
  31338 19:42:42.016018 select(2, NULL, [1], [1], {60, 0}) = 1 (out
    [1], left {60,
   0}) <0.000007>
  31338 19:42:42.016071 write(1, "\4\0\0\7\377\377\377\377", 8) = 8
    <0.000025>
  31338 19:42:42.016129 select(1, [0], [], NULL, {60, 0}) = 0
    (Timeout) <59.988367>
  31338 19:43:42.004567 select(1, [0], [], NULL, {60, 0}) = 0
    (Timeout) <59.988089>
  31338 19:44:41.992744 select(1, [0], [], NULL, {60, 0}) = 0
    (Timeout) <59.987474>

Note that it transfers the /etc/zshrc file fine. I have seen this
before and it is basically hung until the SIGALRM timeout (at 20
hours) kicks in. Nothing ever comes to it on stdin (fd 0). The
lsof output of the rsync process is:

  COMMAND   PID USER   FD   TYPE             DEVICE    SIZE  NODE NAME
  rsync   31338 root  cwd    DIR              253,2    4096  357281 /home/backup
  rsync   31338 root  rtd    DIR              253,0    4096   2 /
  rsync   31338 root  txt    REG              253,1  300432  1624023 
/usr/bin/rsync
  rsync   31338 root  mem    REG              253,0  130304  225404 
/lib64/ld-2.5.so
  rsync   31338 root  mem    REG              253,0 1687464  225405 
/lib64/libc-2.5.so
  rsync   31338 root  mem    REG              253,1   34240  2047188 
/usr/lib64/libpopt.so.0.0.0
  rsync   31338 root    0u  unix 0xffff8101178a29c0  1345341767 socket
  rsync   31338 root    1u  unix 0xffff8101178a29c0  1345341767 socket
  rsync   31338 root    2u  unix 0xffff81007ce3e8c0  1345341769 socket

Note that fd 3 is usually the file being transferred by the
rsync. That's how I can tell how far it has progressed.

Looking on the backuppc server side, the process tree looks like:

  strace(8005)---BackupPC_dump(8007)-+-BackupPC_dump(8024)
                                     `-ssh(8023)

the strace output from the child BackupPC_dump 8024 (which I
assume is talking to the rsync on up01) is:

  8024  19:42:42.015114 read(11, "l\5\0\7", 65536) = 4 <0.000012>
  8024  19:42:42.015229 alarm(72000)      = 72000 <0.000010>
  8024  19:42:42.015294 select(16, [11], NULL, [11], NULL <unfinished ...>
  8024  19:42:42.015541 <... select resumed> ) = 1 (in [11]) <0.000222>
  8024  19:42:42.015593 read(11,  <unfinished ...>
  8024  19:42:42.015666 <... read resumed>
    
"\231\201b\220\203\350_\346\207\6\0\0\1\0\0\0\0\10\0\0\2\0\0\0F\1\0\0\377\377\377\377"...,
    65536) = 1388 <0.000039>
  8024  19:42:42.015769 alarm(72000 <unfinished ...>
  8024  19:42:42.015831 <... alarm resumed> ) = 72000 <0.000041>
  8024  19:42:42.015890 select(16, [11], NULL, [11], NULL  <unfinished ...>
  8024  19:42:42.015956 <... select resumed> ) = 1 (in [11]) <0.000040>
  8024  19:42:42.016009 read(11,  <unfinished ...>
  8024  19:42:42.016068 <... read resumed>
       "\4\0\0\7\377\377\377\377", 65536) = 8 <0.000040>
  8024  19:42:42.016112 alarm(72000)      = 72000 <0.000009>
  8024  19:42:42.016168 alarm(72000)      = 72000 <0.000008>
  8024  19:42:42.016218 select(16, [11], NULL, [11], NULL <unfinished ...>
  8024  19:54:35.090490 <... select resumed> ) = ? ERESTARTNOHAND
     (To be restarted) <713.074237>
  8024  19:54:35.090563 --- SIGWINCH (Window changed) @ 0 (0) ---
  8024  19:54:35.090607 select(16, [11], NULL, [11], NULL <unfinished ...>
  8024  19:54:35.121181 <... select resumed> ) = ? ERESTARTNOHAND
     (To be restarted) <0.030543>

with the select on fd 11 just continuing. (The sigwinch is due to
my changing the window size ox my xterm to display the ps tree
output properly.)

The strace of the parent BackupPC_dump is:
    
  8007 19:42:35.297051
    open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fzshrc",
    O_RDONLY|O_LARGEFILE) = 8 <0.000015>
  8007 19:42:35.297142 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS,
    0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device)
    <0.000006>
  8007 19:42:35.297184 _llseek(8, 0, [0], SEEK_CUR) = 0
    <0.000006>
  8007 19:42:35.297226 fstat64(8, {st_mode=S_IFREG|0640,
    st_size=468, ...}) = 0 <0.000006>
  8007 19:42:35.297300 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
    <0.000006>
  8007 19:42:35.297341 read(8,
    
"\327^]RMo\0231\20=\307\277\342\21\357*\200\232D\345ZE\342\10R\21U\213\340P\365\340"...,
    4096) = 468 <0.000008>
  8007 19:42:35.297444 write(2,
    "/data/backup-pc/pc/up01.bos1.ren"..., 93) = 93 <0.000009>
  8007 19:42:35.297576 write(2, "Sending csums for zshrc
    (size=64"..., 35) = 35 <0.000009>
  8007 19:42:35.297722 close(8) = 0 <0.000007>
  8007 19:42:35.297781 select(8, [7], NULL, [7], {0, 0}) = 0
    (Timeout) <0.000007>
  8007 19:42:35.297848 alarm(72000) = 71999 <0.000011>
  8007 19:42:35.297911 select(16, [7], [10], [7 10], NULL) = 1
    (out [10]) <0.000008>
  8007 19:42:35.297981 write(10,
    
"\253Fx<\34136F]s\326tH8=y\350\2\3772\214\305\10\266\307W:\253\243\317NC"...,
    17840 <unfinished ...>
  8007 19:42:35.298792 <... write resumed> ) = 17840 <0.000772>
  8007 19:42:35.298897 write(2, "Sending csums, cnt = 1, phase =
    "..., 34) = 34 <0.000010>
  8007 19:42:35.299150
    open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-",
    O_RDONLY|O_LARGEFILE) = 8 <0.000018>
  8007 19:42:35.299243 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS,
    0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device)
    <0.000006>
  8007 19:42:35.299287 _llseek(8, 0, [0], SEEK_CUR) = 0
    <0.000006>
  8007 19:42:35.299330 fstat64(8, {st_mode=S_IFREG|0640,
    st_size=281, ...}) = 0 <0.000006>
  8007 19:42:35.299406 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
    <0.000006>
  8007 19:42:35.299448 read(8, "\327^m\220A\222\304
    \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G
    \251\232\305,"..., 4096) = 281 <0.000008>                      
  8007 19:42:35.299552 write(2,
    "/data/backup-pc/pc/up01.bos1.ren"..., 105) = 105 <0.000009>
  8007 19:42:35.299702
    stat64("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-",
    {st_mode=S_IFREG|0640, st_size=281, ...}) = 0 <0.000013>
  8007 19:42:35.299847
    open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-",
    O_RDONLY|O_LARGEFILE) = 9 <0.000013>
  8007 19:42:35.299920 ioctl(9, SNDCTL_TMR_TIMEBASE or TCGETS,
    0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device)
    <0.000006>
  8007 19:42:35.299961 _llseek(9, 0, [0], SEEK_CUR) = 0
    <0.000005>
  8007 19:42:35.300003 fstat64(9, {st_mode=S_IFREG|0640,
    st_size=281, ...}) = 0 <0.000006>
  8007 19:42:35.300077 fcntl64(9, F_SETFD, FD_CLOEXEC) = 0
    <0.000006>
  8007 19:42:35.300196 read(9, "\327^m\220A\222\304
    \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G
    \251\232\305,"..., 131072) = 281 <0.000008>                    
  8007 19:42:35.300365 _llseek(9, -69, [212], SEEK_CUR) = 0
    <0.000006>
  8007 19:42:35.300451 _llseek(9, 0, [212], SEEK_CUR) = 0
    <0.000006>
  8007 19:42:35.300509 close(9) = 0 <0.000007>
  8007 19:42:35.300569
    open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-",
    O_RDWR|O_LARGEFILE) = 9 <0.000015>
  8007 19:42:35.300668 ioctl(9, SNDCTL_TMR_TIMEBASE or TCGETS,
    0xbf86b4d8) = -1 ENOTTY (Inappropriate ioctl for device)
    <0.000007>
  8007 19:42:35.300711 _llseek(9, 0, [0], SEEK_CUR) = 0
    <0.000006>
  8007 19:42:35.300753 fstat64(9, {st_mode=S_IFREG|0640,
    st_size=281, ...}) = 0 <0.000007>
  8007 19:42:35.300828 fcntl64(9, F_SETFD, FD_CLOEXEC) = 0
    <0.000006>
  8007 19:42:35.300873 read(9, "\327", 1) = 1 <0.000007>
  8007 19:42:35.300920 _llseek(9, 212, [212], SEEK_SET) = 0
    <0.000006>
  8007 19:42:35.300980 read(9,
    
"\263\210\2208\352\341s\252\31...@\25\272\315\207\233\214\r\213\0309\1o)\32\255\303#\327\17\210"...,
    70) = 69 <0.000007>
  8007 19:42:35.301051 close(9) = 0 <0.000007>
  8007 19:42:35.301116 write(2, "gshadow-: verified cached
    digest"..., 33) = 33 <0.000009>
  8007 19:42:35.301243 close(8) = 0 <0.000006>
  8007 19:42:35.301320
    open("/data/backup-pc/pc/up01.bos1.renesys.com/571/f%2fetc/fgshadow-",
    O_RDONLY|O_LARGEFILE) = 8 <0.000015>
  8007 19:42:35.301396 ioctl(8, SNDCTL_TMR_TIMEBASE or TCGETS,
    0xbf86b4f8) = -1 ENOTTY (Inappropriate ioctl for device)
    <0.000006>
  8007 19:42:35.301437 _llseek(8, 0, [0], SEEK_CUR) = 0
    <0.000005>
  8007 19:42:35.301480 fstat64(8, {st_mode=S_IFREG|0640,
    st_size=281, ...}) = 0 <0.000006>
  8007 19:42:35.301555 fcntl64(8, F_SETFD, FD_CLOEXEC) = 0
    <0.000006>
  8007 19:42:35.301612 read(8, "\327^m\220A\222\304
    \10E\367\234%\'\310m\210\222N\252E-\321\351\316\355G
    \251\232\305,"..., 4096) = 281 <0.000009>
  8007 19:42:35.301725 write(2,
    "/data/backup-pc/pc/up01.bos1.ren"..., 96) = 96 <0.000010>
  8007 19:42:35.301868 write(2, "Sending csums for gshadow-
    (size"..., 38) = 38 <0.000010>
  8007 19:42:35.302024 close(8) = 0 <0.000008>
  8007 19:42:35.302103 select(8, [7], NULL, [7], NULL <unfinished
    ...>
  8007 19:54:35.090584 <... select resumed> ) = ? ERESTARTNOHAND
    (To be restarted) <719.788455>
  8007 19:54:35.090662 --- SIGWINCH (Window changed) @ 0 (0) ---
  8007 19:54:35.090711 select(8, [7], NULL, [7], NULL <unfinished
    ...>
  8007 19:54:35.121243 <... select resumed> ) = ? ERESTARTNOHAND
    (To be restarted) <0.030506>

with the select on fd 7 continuing. The lsof of the parent
BackupPC process (8007) is:

  COMMAND    PID   USER   FD   TYPE     DEVICE     SIZE       NODE  NAME
  [...]
  BackupPC_ 8007 backup  mem    REG      253,1     8808    1400404  
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/Cwd/Cwd.so
  BackupPC_ 8007 backup  mem    REG      253,0  3103092    1063539  
/lib/libnss_ldap-2.5.so
  BackupPC_ 8007 backup  DEL    REG      253,4             1981333  
/var/run/nscd/dbI54MYz
  BackupPC_ 8007 backup  DEL    REG      253,4             1981330  
/var/run/nscd/dbPr0Bwh
  BackupPC_ 8007 backup  mem    REG      253,1 56413488    1401690  
/usr/lib/locale/locale-archive
  BackupPC_ 8007 backup    0u   CHR      136,3                   5  /dev/pts/3
  BackupPC_ 8007 backup    1w  FIFO        0,6          2095041895  pipe
  BackupPC_ 8007 backup    2w  FIFO        0,6          2095041895  pipe
  BackupPC_ 8007 backup    3u  IPv4 2095041985                 TCP  
ops01.fp.bos1.renesys.com:46753->bos1.auth.staff.renesys.com:ldaps
  (ESTABLISHED)
  BackupPC_ 8007 backup    4w   REG        9,2     8597   39944901  
/backup-pc/pc/up01.bos1.renesys.com/LOG.082009
  BackupPC_ 8007 backup    5w   REG        9,2        0   39946542  
/backup-pc/pc/up01.bos1.renesys.com/XferLOG.z
  BackupPC_ 8007 backup    6w   REG        9,2        0   39946835  
/backup-pc/pc/up01.bos1.renesys.com/NewFileList
  BackupPC_ 8007 backup    7u  unix 0xf7685580          2095042158  socket
  BackupPC_ 8007 backup   10u  unix 0xf7685c80          2095042115  socket

The lsof of pid 8024:

  COMMAND    PID   USER   FD   TYPE     DEVICE     SIZE       NODE NAME
  [...]
  BackupPC_ 8024 backup  mem    REG      253,1     8808    1400404 
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/Cwd/Cwd.so
  BackupPC_ 8024 backup  mem    REG      253,0  3103092    1063539 
/lib/libnss_ldap-2.5.so
  BackupPC_ 8024 backup  DEL    REG      253,4             1981333 
/var/run/nscd/dbI54MYz
  BackupPC_ 8024 backup  DEL    REG      253,4             1981330 
/var/run/nscd/dbPr0Bwh
  BackupPC_ 8024 backup  mem    REG      253,1 56413488    1401690 
/usr/lib/locale/locale-archive
  BackupPC_ 8024 backup    0u   CHR      136,3                   5 /dev/pts/3
  BackupPC_ 8024 backup    1w  FIFO        0,6          2095041895 pipe
  BackupPC_ 8024 backup    2w  FIFO        0,6          2095041895 pipe
  BackupPC_ 8024 backup    3u  IPv4 2095041985                 TCP 
ops01.fp.bos1.renesys.com:46753->bos1.auth.staff.renesys.com:ldaps
  (ESTABLISHED)
  BackupPC_ 8024 backup    4w   REG        9,2     8597   39944901 
/backup-pc/pc/up01.bos1.renesys.com/LOG.082009
  BackupPC_ 8024 backup    5w   REG        9,2        0   39946542 
/backup-pc/pc/up01.bos1.renesys.com/XferLOG.z
  BackupPC_ 8024 backup    6w   REG        9,2        0   39946835 
/backup-pc/pc/up01.bos1.renesys.com/NewFileList
  BackupPC_ 8024 backup    9u  unix 0xcfb97040          2095042159 socket
  BackupPC_ 8024 backup   11u  unix 0xf7685c80          2095042115 socket

The lsof of the ssh (pid 8023) shows:
  COMMAND    PID   USER   FD   TYPE     DEVICE     SIZE     NODE NAME
  [...]
  ssh     8023 backup  mem    REG      253,0    7748    1061368  
/lib/libcom_err.so.2.1
  ssh     8023 backup  mem    REG      253,1  157304    1405482  
/usr/lib/libk5crypto.so.3.1
  ssh     8023 backup  DEL    REG      253,4            1981333  
/var/run/nscd/dbI54MYz
  ssh     8023 backup  DEL    REG      253,4            1981330  
/var/run/nscd/dbPr0Bwh
  ssh     8023 backup    0u  unix 0xf7685740         2095042114  socket
  ssh     8023 backup    1u  unix 0xf7685740         2095042114  socket
  ssh     8023 backup    2u  unix 0xf7685740         2095042114  socket
  ssh     8023 backup    3u  IPv4 2095042155                TCP  
ops01.fp.bos1.renesys.com:48563->up01.fp.bos1.renesys.com:ssh
  (ESTABLISHED)
  ssh     8023 backup    4u  unix 0xf7685740         2095042114  socket
  ssh     8023 backup    5u  unix 0xf7685740         2095042114  socket
  ssh     8023 backup    6u  unix 0xf7685740         2095042114  socket

The strace of the sshd does show data transfer occurring, but I
think it it ssh procol keepalive etc messages and not actual
data.

I have managed to "solve" this in the past by:

   * Using a full rather than incremental backup. That didn't
     work this time.
   * Blowing away prior completed backups to get back to a state 
     where the new backups no longer hung. Haven't tried that yet.

So does anybody have any ideas what could be wrong, what my next
steps to get debugging info should be?

-- 
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to