On Tue, Dec 15, 2009 at 02:33:06PM +0100, Holger Parplies wrote: > Robin Lee Powell wrote on 2009-12-15 00:22:41 -0800: > > Oh, I agree; in an ideal world, it wouldn't be an issue. I'm > > afraid I don't live there. :) > > none of us do, but you're having problems. We aren't.
How many of you are backing up trees as large as I am? So far, everyone who has commented on the matter has said it's not even close. > The suggestion that your *software* is probably misconfigured in > addition to the *hardware* being flakey makes a lot of sense to > me. Certainly possible, but if it is I genuinely have no idea where the misconfiguration might be. Also note that only the incrementals seem to fail; the initial fulls ran Just Fine (tm). One of them took 31 hours. For what it's worth, here's what a client strace says before things crack on one of my larger incrementals; commentary welcome. - ------------------ open("[customer dir]", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3 fstat(3, {st_mode=S_IFDIR|0777, st_size=3864, ...}) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 getdents(3, /* 12 entries */, 4096) = 616 lstat("[customer dir]/Sapphire_Pearl_Pendant_selector.jpg", {st_mode=S_IFREG|0644, st_size=1363, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_gallery.jpg", {st_mode=S_IFREG|0644, st_size=5482, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_shop_banner.jpg", {st_mode=S_IFREG|0644, st_size=19358, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_library.jpg", {st_mode=S_IFREG|0644, st_size=2749, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_badge.jpg", {st_mode=S_IFREG|0644, st_size=8073, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_browse.jpg", {st_mode=S_IFREG|0644, st_size=2352, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_display.jpg", {st_mode=S_IFREG|0644, st_size=33957, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_segment.jpg", {st_mode=S_IFREG|0644, st_size=1152, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant.JPG", {st_mode=S_IFREG|0644, st_size=88733, ...}) = 0 lstat("[customer dir]/Sapphire_Pearl_Pendant_market_banner.jpg", {st_mode=S_IFREG|0644, st_size=21168, ...}) = 0 getdents(3, /* 0 entries */, 4096) = 0 close(3) = 0 gettimeofday({1260864378, 747386}, NULL) = 0 gettimeofday({1260864378, 747429}, NULL) = 0 mmap(NULL, 20398080, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ba2eb836000 munmap(0x2ba2eb836000, 20398080) = 0 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\217\2\0\7badge.jpgO/\0\0g\272\341I:3\17shop_banner.jpg\334\332\0\0h\272\341I:3\nbrowse.jpg\10\16\0\0i\272\341I:3\vlibrary.jpg\\\17\0\0h\272\341I:3\fselector.jpg\16\10\0\0j\272\341I:3\21market_banner.jpg\312\346\0\0f\272\341I:2\4.jpgx\255\7\0e\272\341I:2\f_display.jpg\r\235\0\0g\272\341I:3\vsegment.jpgk\6\0\0j\272\341I:$$9642/Banner_-BODIE-Piano_display.jpg\3447\0\0jc\342I\272=\tbadge.jpg\211\30\0\0\272=\vsegment.jpgn\4\0\0\272?\nlector.jpg>\5\0\0:>\16hop_banner.jpg\226d\0\0-$\270J:=\nbrowse.jpgs\10\0\0jc\342I\272=\vlibrary.jpgO\10\0\0\272<\4.jpg\342p\0\0\272<\f_gallery.jpg.\22\0\0\272=\21market_banner.jpg\25\203\0\0:$(2987/sapphire_pearl_pendant_selector.jpgs\5\0\0\344~\341i\...@\vgallery.jpgj\25\0\0:@\17shop_banner.jpg\236k\0\0\343~\341i\...@\vlibrary.jpg\275\n\0\0:@\tbadge.jpg\211\37\0\0\342~\341I:A\trowse.jpg0\t\0\0\344~\341I:@\vdisplay.jpg\245\204\0\0\343~\341I:@\vsegment.jpg\200\4\0\0\344~\341I:?\4.JPG\235Z\1\0\342~\341I\272?\22_market_banner.jpg\260R\0\0\0\0\0\0\0", 659) = 659 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) [snip lots of timeouts] select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {1, 212000}) read(0, "", 8184) = 0 select(2, NULL, [1], [1], {60, 0}) = 1 (out [1], left {60, 0}) write(1, "K\0\0\10rsync: connection unexpectedly closed (179 bytes received so far) [sender]\n", 79) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- write(2, "rsync: writefd_unbuffered failed to write 79 bytes [sender]: Broken pipe (32)", 77) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 write(2, "rsync error: errors with program diagnostics (code 13) at log.c(237) [sender=3.0.5]", 83) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0 gettimeofday({1260871590, 977217}, NULL) = 0 select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout) gettimeofday({1260871591, 76871}, NULL) = 0 exit_group(13) = ? - ------------------ And here's the entirety of the xfer log errors: - ------------------ incr backup started back to 2009-12-12 00:00:08 (backup #0) for directory / Running: /usr/bin/ssh -p 8416 -i /engineyard/backuppc/.ssh/id_dsa -o StrictHostKeyChecking=no -q -x -l root 65.74.174.196 /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --times --block-size=2048 --recursive . / Xfer PIDs are now 24552 Got remote protocol 30 Negotiated protocol version 28 Checksum seed is 1260871202 Got checksumSeed 0x4b275e22 Sent include: /[dir] Sent include: /[dir]/[dir] Sent include: /[dir]/[dir]/[dir] Sent include: /[dir]/[dir]/[dir]/[dir] Sent exclude: /* Sent exclude: /[dir]/* Sent exclude: /[dir]/[dir]/* Sent exclude: /[dir]/[dir]/[dir]/* Got file list: 5099022 entries Child PID is 32756 Xfer PIDs are now 24552,32756 Sending csums, cnt = 5099022, phase = 0 Read EOF: Tried again: got 0 bytes Child is aborting Parent read EOF from child: fatal error! Sending csums, cnt = 2251037, phase = 1 Done: 0 files, 0 bytes Got fatal error during xfer (Child exited prematurely) Backup aborted (Child exited prematurely) - ------------------ I see no sign of system level troubles on either client or server. Unfortunately, I wasn't running strace on the other large backup, but it also failed, at a completely different time. The really fun part is that the date when the strace exited (was doing "strace -p NUM ; date") is 6 hours before the BackupPC server claims that the backup aborted. My ClientTimeout is set to 72000; both backups aborted significantly *after* the twenty hour mark. It's not relevant anyways, though; the connection was clearly broken on the client end long before BackupPC timed out. I'm totally willing to accept that the problem might be hardware or software config on my end, but: 1. It seems to only happen with incrementals. 2. I have no idea even where to look; everything looks fine at a system level as I understand it. I don't have the networking skill to debug the networking end (the two machines are seperate RFC 1918 address ranges, with a load balancer/firewall associated with each (2 total) between them, plus a bunch of switches and so on). Given that, it seems completely bizarre to me that you all are, I dunno, morally offended? that I proposed increases BackupPC's resilience to transient errors as a solution. In the meantime, I guess I'll go try sharding things and hope it doesn't overload the client's running production software too much, because I don't see another way out of this.I 'm certainly not interested in maintaining my own patches or fork. I'd like to think that if I made my idea run-time optional y'all would roll it in, but the response has been so negative I'm worried. Also, it's a lot of work. -_- -Robin -- They say: "The first AIs will be built by the military as weapons." And I'm thinking: "Does it even occur to you to try for something other than the default outcome?" See http://shrunklink.com/cdiz http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/