Hi guys:

First post for me on this list. 

THE PROBLEM
===========
I have several linux servers (ranging from Redhat 7.3 to FC3) that use Rsync 
(version 2.6.8 and 2.6.7, read on for reasons) to backup their data. A typical 
rsync backup script looks like this:

------------------------------------------------------------------------------------------
#!/bin/sh
RSYNC_PASSWORD=<PASSWORD>
rsync -a --delete --password-file=<PASSWORD FILE> --ignore-errors --exclude 
'/dev/*' --exclude '/sys/*' --exclude '/proc/*' --exclude '/var/run/*' 
--exclude '/tmp/*' --exclude '/home/downloads/*' / 
<USERNAME>@<ADDRESS>::<MODULE> >> /var/log/rsync.log
rsync -a --progress --delete --ignore-errors --password-file=<PASSWORD FILE> 
/home/downloads/ <USERNAME>@<ADDRESS>::<MODULE> >> /var/log/rsync.log
------------------------------------------------------------------------------------------
(Some things have been replaced to prevent people doing mean things to our 
servers :P)

For a long time rsync has worked, occasionly barfing due to lack of harddrive 
space or for other external reasons. When we upgraded to version 2.6.8 however, 
things started to die.

Here is a typical error we get:
------------------------------------------------------------------------------------------
unexpected tag 3 [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(828) 
[sender=2.6.8]
------------------------------------------------------------------------------------------

SERVERS
=======
For the purpose of this email, I will talk about two servers in particular. One 
is an apache server, which I will refer to as APACHE, the other our backup 
server, which we will call RSERVER. 

APACHE is a Fedora Core release 3 (Heidelberg) box, running Linux 
2.6.12-1.1381_FC3 with Rsync 2.6.8, installed from an RPM. All software is up 
to date to my knowledge. 

RSERVER is also a Fedora Core release 3 (Heidelberg) box, also running Linux 
2.6.12-1.1381_FC3 with Rsync 2.6.8, again installed from RPM. Again, all 
software is up to date.

SETUP
=====
The script you see above is the script used to copy the data between APACHE and 
RSERVER. It runs once a night, starting at 21:00 and goes for some hours. The 
amount of data it has to think about amounts to is some 100 gig. What it has to 
sync over the course of an evening can be siginificantly less (I see two 
reports that say 46 meg and 32 meg). 

The two machines talk to each other over a 100 megabit switched LAN. APACHE is 
a server that has port forwards to it and has no software firewall running 
(well Iptables almost always runs, but it has no rules and all chains are set 
to ACCEPT). 
The same with RSERVER.

The script is kicked off by a cronjob running as root.

DETAILS
=======
Around about 10 minutes after the script runs, it will fail with the above 
error message. With small amounts of data (eg 10 meg, 20 meg, no known 
boundary) the script will work. In fact, I see that some files do work with the 
cron script. However when I start to sync more data (1.6 gig, 1.7 gig etc), it 
fails part of the way through with the above.

Interestingly, CWRsync seems to work (ie Rsync running under Cygwin on 
Windows). However this is not exclusive. Some do, some do not. I cannot work 
out why (it may be because they are under the size threshold?)

TROUBLESHOOTING
===============
- Small Rsyncs will work. No known boundary between small and large though (ie 
I cant say "More than this amount of data it will fail")
- Tried running Rsync in Xinetd.
- Tried running Rsync as a daemon, as root
- Using blocking-io
- Using no-blocking-io
- Pings and SSH sessions to watch for network problems.
- Packet dumps (very limited, I have trouble with TCPDump... give me ethereal 
any day :P)
- Straces
- Replacing 2.6.8 with 2.6.7. 2.6.7 used to work. (that's why I have said above 
two seperate version):
        We get a marginally different error message:
        
------------------------------------------------------------------------------------------
        unexpected tag 3 [sender]
        rsync error: error in rsync protocol data stream (code 12) at io.c(843) 
[sender]
        
------------------------------------------------------------------------------------------
        The difference is that the sender on the 2nd line does not have 
(2.6.8). That's the only difference I can find. 

STRACE #1
=========
I can provide whole straces if people request, but here are some interesting 
things I have found.

This is on server RSERVER, the rsync server side, running 2.6.8 as xinetd:
------------------------------------------------------------------------------------------
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 143) = 143
21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "s\0\0\n", 4)             = 4
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 115) = 115
21219 time(NULL)                        = 1146546636
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 143) = 143
21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "j\0\0\n", 4)             = 4
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 106) = 106
21219 time(NULL)                        = 1146546636
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 134) = 134
21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "f\0\0\n", 4)             = 4
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "recv 51.32.168.192.in-addr.arpa "..., 102) = 102
21219 time(NULL)                        = 1146546636
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] recv"..., 130) = 130
21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "2\0\0\r", 4)             = 4
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "rsync: read error: Connection re"..., 50) = 50
21219 time(NULL)                        = 1146546636
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] rsyn"..., 78) = 78
21219 select(5, [4], [0], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "Y\0\0\10", 4)            = 4
21219 select(5, [4], [], NULL, {60, 0}) = 1 (in [4], left {60, 0})
21219 read(4, "rsync error: error in rsync prot"..., 89) = 89
21219 time(NULL)                        = 1146546636
21219 open("/etc/localtime", O_RDONLY)  = -1 ENOENT (No such file or directory)
21219 write(3, "2006/05/02 05:10:36 [21219] rsyn"..., 117) = 117
------------------------------------------------------------------------------------------
Notice /etc/localtime not being able to open? 

------------------------------------------------------------------------------------------
[EMAIL PROTECTED] tmp]# ls -lah /etc/localtime
lrwxrwxrwx  1 root root 36 Mar 27 13:50 /etc/localtime -> 
/usr/share/zoneinfo/Australia/Sydney
[EMAIL PROTECTED] tmp]# ls -lah /usr/share/zoneinfo/Australia/Sydney
-rw-r--r--  1 root root 785 Mar 27 11:05 /usr/share/zoneinfo/Australia/Sydney
[EMAIL PROTECTED] tmp]#
------------------------------------------------------------------------------------------
Another weird thing is that it does the same for /etc/fstab! (The below is 
significantly further back in the STRACE than the above)
------------------------------------------------------------------------------------------
21219 open("/etc/mtab", O_RDONLY)       = -1 ENOENT (No such file or directory)
21219 open("/etc/fstab", O_RDONLY)      = -1 ENOENT (No such file or directory)
21219 open("/proc/meminfo", O_RDONLY)   = -1 ENOENT (No such file or directory)
------------------------------------------------------------------------------------------
Yet at other places in the same STRACE, it does open the file:
------------------------------------------------------------------------------------------
21219 open("/etc/localtime", O_RDONLY)  = 3
21219 fstat64(3, {st_mode=S_IFREG|0644, st_size=785, ...}) = 0
21219 fstat64(3, {st_mode=S_IFREG|0644, st_size=785, ...}) = 0
------------------------------------------------------------------------------------------
I assume that is exit code 0, which means it works. Either way it doesn't throw 
an error I can see.

STRACE #2
=========
This is an strace run on APACHE (the rsync client). I apologise, I cannot say 
whether it was 2.6.8 or 2.6.7. I'm pretty sure it's 2.6.8. 
Here is the interesting section:

------------------------------------------------------------------------------------------
5503  gettimeofday({1146614119, 369460}, NULL) = 0
5503  ioctl(1, TIOCGPGRP, [48])         = -1 ENOTTY (Inappropriate ioctl for 
device)
5503  write(1, "         583 100%    0.94kB/s   "..., 41) = 41
5503  select(4, NULL, [3], NULL, {60, 0}) = 1 (out [3], left {60, 0})
5503  write(3, "\333\323\361\276[\177\376\33\3\34\302\4\235.\237\301", 16) = 16
5503  gettimeofday({1146614119, 369703}, NULL) = 0
5503  write(1, "         583 100%    0.94kB/s   "..., 75) = 75
5503  close(4)                          = 0
5503  select(4, [3], [], NULL, {60, 0}) = 1 (in [3], left {60, 0})
5503  read(3, "Y\0\0\n", 4)             = 4
5503  write(2, "unexpected tag 3 [sender]", 25) = 25
5503  write(2, "\n", 1)                 = 1
5503  rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
5503  rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
5503  write(2, "rsync error: error in rsync prot"..., 80) = 80
5503  write(2, "\n", 1)                 = 1
5503  munmap(0xb7d67000, 4096)          = 0
5503  exit_group(12)                    = ?
------------------------------------------------------------------------------------------
Specifically the 2nd line: 
5503  ioctl(1, TIOCGPGRP, [48])         = -1 ENOTTY (Inappropriate ioctl for 
device)

TCPDUMPS
========
I played with this for all of about 3 minutes. I didn't get far.

RSYNC LOG FILE
==============
2006/05/03 00:28:31 [22725] rsync: writefd_unbuffered failed to write 131 bytes 
[generator]: Connection reset by peer (104)
2006/05/03 00:28:31 [22725] rsync error: error in rsync protocol data stream 
(code 12) at io.c(1119) [generator=2.6.8]

OTHER INFO
==========
------------------------------------------------------------------------------------------
[EMAIL PROTECTED] tmp]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda1             9.7G  3.6G  5.6G  39% /
none                  379M     0  379M   0% /dev/shm
/dev/hda4              17G  675M   15G   5% /home
/dev/hda2             9.7G  360M  8.8G   4% /var
/dev/hdb1             276G  176G   87G  68% /disk2
/dev/hdc1             276G  175G   87G  67% /disk3
/dev/sda1             276G  225G   37G  87% /media/usbdisk
------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------
/etc/rsyncd.conf:
#global paramaters
log file = /var/log/rsyncd.log
read only = no
uid = root

[APACHE]
comment = Backup of APACHE
path = /disk3/backup/0/APACHE
use chroot = yes
auth users = <USERNAME>
secrets file = /etc/rsyncd.secrets
hosts allow = <SUBNET> 127.0.0.1
transfer logging = yes
#log file = /var/log/rsync.APACHE.log
------------------------------------------------------------------------------------------

MY THOUGHTS
===========
It seems that rsync is reaching some random file and then going, nope, I've 
done enough work. Go away. I've tried looking for bad files, but I have found 
nothing.

FROM HERE
=========
I'm just about out of ideas what to do... Returning to the old version doesn't 
seem to fix it. I'm more than happy to provide more information if requested. 
I've put all that I can think into the above email. Let me know if you need 
anything else and I'll provide it.

Many thanks,

Anthony Sadler
Far Edge Technology
w: (02) 8425 1400
 

--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Reply via email to