Robin Lee Powell wrote: > On Tue, Dec 15, 2009 at 02:33:06PM +0100, Holger Parplies wrote: > >> Robin Lee Powell wrote on 2009-12-15 00:22:41 -0800: >> >>> Oh, I agree; in an ideal world, it wouldn't be an issue. I'm >>> afraid I don't live there. :) >>> >> none of us do, but you're having problems. We aren't. >> > > How many of you are backing up trees as large as I am? So far, > everyone who has commented on the matter has said it's not even > close. >
For what it's worth, I'm certainly not backing up trees as large as yours (my largest host is approaching 100GB in 1.25 million files, which can take more than10 hours), but I do have a 50GB host backing up over satellite. Barring network outages, my backups work quite reliably. >> The suggestion that your *software* is probably misconfigured in >> addition to the *hardware* being flakey makes a lot of sense to >> me. >> > > Certainly possible, but if it is I genuinely have no idea where the > misconfiguration might be. Also note that only the incrementals > seem to fail; the initial fulls ran Just Fine (tm). One of them > took 31 hours. > And I would imagine that data was flowing over the link the whole time. My guess would be that your firewalls are set up to close "inactive" TCP sessions. Try adding "-o ServerAliveInterval=60" to your RsyncClientCmd (so it looks something like "$sshPath -C -q -x -o ServerAliveInterval=60 -l root $host $rsyncPath $argList+") and see if that solves your problem. > For what it's worth, here's what a client strace says before things > crack on one of my larger incrementals; commentary welcome. > > - ------------------ > > open("[customer dir]", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 > fstat(3, {st_mode=S_IFDIR|0777, st_size=3864, ...}) = 0 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > getdents(3, /* 12 entries */, 4096) = 616 > lstat("[customer dir]/Sapphire_Pearl_Pendant_selector.jpg", > {st_mode=S_IFREG|0644, st_size=1363, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_gallery.jpg", > {st_mode=S_IFREG|0644, st_size=5482, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_shop_banner.jpg", > {st_mode=S_IFREG|0644, st_size=19358, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_library.jpg", > {st_mode=S_IFREG|0644, st_size=2749, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_badge.jpg", > {st_mode=S_IFREG|0644, st_size=8073, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_browse.jpg", > {st_mode=S_IFREG|0644, st_size=2352, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_display.jpg", > {st_mode=S_IFREG|0644, st_size=33957, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_segment.jpg", > {st_mode=S_IFREG|0644, st_size=1152, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant.JPG", {st_mode=S_IFREG|0644, > st_size=88733, ...}) = 0 > lstat("[customer dir]/Sapphire_Pearl_Pendant_market_banner.jpg", > {st_mode=S_IFREG|0644, st_size=21168, ...}) = 0 > getdents(3, /* 0 entries */, 4096) = 0 > close(3) = 0 > gettimeofday({1260864378, 747386}, NULL) = 0 > gettimeofday({1260864378, 747429}, NULL) = 0 > mmap(NULL, 20398080, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) > = 0x2ba2eb836000 > munmap(0x2ba2eb836000, 20398080) = 0 > select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) > write(1, > "\217\2\0\7badge.jpgO/\0\0g\272\341I:3\17shop_banner.jpg\334\332\0\0h\272\341I:3\nbrowse.jpg\10\16\0\0i\272\341I:3\vlibrary.jpg\\\17\0\0h\272\341I:3\fselector.jpg\16\10\0\0j\272\341I:3\21market_banner.jpg\312\346\0\0f\272\341I:2\4.jpgx\255\7\0e\272\341I:2\f_display.jpg\r\235\0\0g\272\341I:3\vsegment.jpgk\6\0\0j\272\341I:$$9642/Banner_-BODIE-Piano_display.jpg\3447\0\0jc\342I\272=\tbadge.jpg\211\30\0\0\272=\vsegment.jpgn\4\0\0\272?\nlector.jpg>\5\0\0:>\16hop_banner.jpg\226d\0\0-$\270J:=\nbrowse.jpgs\10\0\0jc\342I\272=\vlibrary.jpgO\10\0\0\272<\4.jpg\342p\0\0\272<\f_gallery.jpg.\22\0\0\272=\21market_banner.jpg\25\203\0\0:$(2987/sapphire_pearl_pendant_selector.jpgs\5\0\0\344~\341i\...@\vgallery.jpgj\25\0\0:@\17shop_banner.jpg\236k\0\0\343~\341i\...@\vlibrary.jpg\275\n\0\0:@\tbadge.jpg\211\37\0\0\342~\341I:A\trowse.jpg0\t\0\0\344~\341I:@\vdisplay.jpg\245\204\0\0\343~\341I:@\vsegment.jpg\200\4\0\0\344~\341I:?\4.JPG\235Z\1\0\342~\341I\272?\22_market_banner.jpg\260R\0\0\0\0\0\0\0", > 659) = 659 > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > [snip lots of timeouts] > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {1, 212000}) > read(0, "", 8184) = 0 > select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) > write(1, "K\0\0\10rsync: connection unexpectedly closed (179 bytes received > so far) [sender]\n", 79) = -1 EPIPE (Broken pipe) > --- SIGPIPE (Broken pipe) @ 0 (0) --- > The client is seeing the connection unexpectedly closed... > write(2, "rsync: writefd_unbuffered failed to write 79 bytes [sender]: Broken > pipe (32)", 77) = -1 EPIPE (Broken pipe) > --- SIGPIPE (Broken pipe) @ 0 (0) --- > rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 > write(2, "rsync error: errors with program diagnostics (code 13) at > log.c(237) [sender=3.0.5]", 83) = -1 EPIPE (Broken pipe) > --- SIGPIPE (Broken pipe) @ 0 (0) --- > rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 > rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 > gettimeofday({1260871590, 977217}, NULL) = 0 > select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) > gettimeofday({1260871591, 76871}, NULL) = 0 > exit_group(13) = ? > > - ------------------ > > And here's the entirety of the xfer log errors: > > - ------------------ > > incr backup started back to 2009-12-12 00:00:08 (backup #0) for directory / > Running: /usr/bin/ssh -p 8416 -i /engineyard/backuppc/.ssh/id_dsa -o > StrictHostKeyChecking=no -q -x -l root 65.74.174.196 /usr/bin/rsync --server > --sender --numeric-ids --perms --owner --group -D --links --times > --block-size=2048 --recursive . / > Xfer PIDs are now 24552 > Got remote protocol 30 > Negotiated protocol version 28 > Checksum seed is 1260871202 > Got checksumSeed 0x4b275e22 > Sent include: /[dir] > Sent include: /[dir]/[dir] > Sent include: /[dir]/[dir]/[dir] > Sent include: /[dir]/[dir]/[dir]/[dir] > Sent exclude: /* > Sent exclude: /[dir]/* > Sent exclude: /[dir]/[dir]/* > Sent exclude: /[dir]/[dir]/[dir]/* > Got file list: 5099022 entries > Child PID is 32756 > Xfer PIDs are now 24552,32756 > Sending csums, cnt = 5099022, phase = 0 > Read EOF: > Tried again: got 0 bytes > Child is aborting > Parent read EOF from child: fatal error! > ...and so is the server. Something in the middle is tearing down the TCP connection. Personally, I like to blame firewalls for these sorts of shenanigans. Sometimes I'm even right. ;o) > Sending csums, cnt = 2251037, phase = 1 > Done: 0 files, 0 bytes > Got fatal error during xfer (Child exited prematurely) > Backup aborted (Child exited prematurely) > > - ------------------ > > I see no sign of system level troubles on either client or server. > Unfortunately, I wasn't running strace on the other large backup, > but it also failed, at a completely different time. > > The really fun part is that the date when the strace exited (was > doing "strace -p NUM ; date") is 6 hours before the BackupPC server > claims that the backup aborted. My ClientTimeout is set to 72000; > both backups aborted significantly *after* the twenty hour mark. > It's not relevant anyways, though; the connection was clearly > broken on the client end long before BackupPC timed out. > > I'm totally willing to accept that the problem might be hardware or > software config on my end, but: > > 1. It seems to only happen with incrementals. > > 2. I have no idea even where to look; everything looks fine at a > system level as I understand it. I don't have the networking skill > to debug the networking end (the two machines are seperate RFC 1918 > address ranges, with a load balancer/firewall associated with each > (2 total) between them, plus a bunch of switches and so on). > > Given that, it seems completely bizarre to me that you all are, I > dunno, morally offended? that I proposed increases BackupPC's > resilience to transient errors as a solution. > IMHO, the implementation of your suggested changes seems complex, and the benefits are not obvious to the general use case. Perhaps your proposed changes would be beneficial (maybe even beyond specific corner cases), but perhaps there are simpler methods of achieving the same goal. Mostly, I think, the perceived offense stems from a steady stream of people who post a couple of emails to the list citing problems they are having with BackupPC and suggesting (often times, half baked) solutions to those problems without consideration to the difficulty of implementing those changes or the impact those changes may have to the reliability of the software as it stands*. For the most part, nothing comes of these complaints, beyond a bit of bluster, resulting in a little less patience for the next suggestion. > In the meantime, I guess I'll go try sharding things and hope it > doesn't overload the client's running production software too much, > because I don't see another way out of this.I > > 'm certainly not interested in maintaining my own patches or fork. > I'd like to think that if I made my idea run-time optional y'all > would roll it in, but the response has been so negative I'm worried. > The only person I'm aware of who needs to be convinced of the merit of your changes is the originator of the code, Craig Barratt. Perhaps there are other maintainers, but they do not make themselves known as frequently on the list. By far the most prolific posters are fans who maintain installations and show their support by trying to help others with their installations. The efficacy of this help varies. > Also, it's a lot of work. -_- > Indeed. > -Robin > Chris * I make no claims as to the applicability of this description to you or your suggestion, nor do I defend the actions or attitudes of anyone on this list. I merely state my observations and give my unsolicited opinion on events as I see fit. ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/