Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync

Chris Robertson Tue, 15 Dec 2009 12:35:03 -0800

Robin Lee Powell wrote:
> On Tue, Dec 15, 2009 at 02:33:06PM +0100, Holger Parplies wrote:
>   
>> Robin Lee Powell wrote on 2009-12-15 00:22:41 -0800:
>>     
>>> Oh, I agree; in an ideal world, it wouldn't be an issue.  I'm
>>> afraid I don't live there.  :)
>>>       
>> none of us do, but you're having problems. We aren't. 
>>     
>
> How many of you are backing up trees as large as I am?  So far,
> everyone who has commented on the matter has said it's not even
> close.
>


For what it's worth, I'm certainly not backing up trees as large as 
yours (my largest host is approaching 100GB in 1.25 million files, which 
can take more than10 hours), but I do have a 50GB host backing up over 
satellite.  Barring network outages, my backups work quite reliably.

>> The suggestion that your *software* is probably misconfigured in
>> addition to the *hardware* being flakey makes a lot of sense to
>> me. 
>>     
>
> Certainly possible, but if it is I genuinely have no idea where the
> misconfiguration might be.  Also note that only the incrementals
> seem to fail; the initial fulls ran Just Fine (tm).  One of them
> took 31 hours.
>   

And I would imagine that data was flowing over the link the whole time.

My guess would be that your firewalls are set up to close "inactive" TCP 
sessions.  Try adding "-o ServerAliveInterval=60" to your RsyncClientCmd 
(so it looks something like "$sshPath -C -q -x -o ServerAliveInterval=60 
-l root $host $rsyncPath $argList+") and see if that solves your problem.

> For what it's worth, here's what a client strace says before things
> crack on one of my larger incrementals; commentary welcome.
>
> - ------------------
>
> open("[customer dir]", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
> fstat(3, {st_mode=S_IFDIR|0777, st_size=3864, ...}) = 0
> fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
> getdents(3, /* 12 entries */, 4096)     = 616
> lstat("[customer dir]/Sapphire_Pearl_Pendant_selector.jpg", 
> {st_mode=S_IFREG|0644, st_size=1363, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_gallery.jpg", 
> {st_mode=S_IFREG|0644, st_size=5482, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_shop_banner.jpg", 
> {st_mode=S_IFREG|0644, st_size=19358, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_library.jpg", 
> {st_mode=S_IFREG|0644, st_size=2749, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_badge.jpg", 
> {st_mode=S_IFREG|0644, st_size=8073, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_browse.jpg", 
> {st_mode=S_IFREG|0644, st_size=2352, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_display.jpg", 
> {st_mode=S_IFREG|0644, st_size=33957, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_segment.jpg", 
> {st_mode=S_IFREG|0644, st_size=1152, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant.JPG", {st_mode=S_IFREG|0644, 
> st_size=88733, ...}) = 0
> lstat("[customer dir]/Sapphire_Pearl_Pendant_market_banner.jpg", 
> {st_mode=S_IFREG|0644, st_size=21168, ...}) = 0
> getdents(3, /* 0 entries */, 4096)      = 0
> close(3)                                = 0
> gettimeofday({1260864378, 747386}, NULL) = 0
> gettimeofday({1260864378, 747429}, NULL) = 0
> mmap(NULL, 20398080, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
> = 0x2ba2eb836000
> munmap(0x2ba2eb836000, 20398080)        = 0
> select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
> write(1, 
> "\217\2\0\7badge.jpgO/\0\0g\272\341I:3\17shop_banner.jpg\334\332\0\0h\272\341I:3\nbrowse.jpg\10\16\0\0i\272\341I:3\vlibrary.jpg\\\17\0\0h\272\341I:3\fselector.jpg\16\10\0\0j\272\341I:3\21market_banner.jpg\312\346\0\0f\272\341I:2\4.jpgx\255\7\0e\272\341I:2\f_display.jpg\r\235\0\0g\272\341I:3\vsegment.jpgk\6\0\0j\272\341I:$$9642/Banner_-BODIE-Piano_display.jpg\3447\0\0jc\342I\272=\tbadge.jpg\211\30\0\0\272=\vsegment.jpgn\4\0\0\272?\nlector.jpg>\5\0\0:>\16hop_banner.jpg\226d\0\0-$\270J:=\nbrowse.jpgs\10\0\0jc\342I\272=\vlibrary.jpgO\10\0\0\272<\4.jpg\342p\0\0\272<\f_gallery.jpg.\22\0\0\272=\21market_banner.jpg\25\203\0\0:$(2987/sapphire_pearl_pendant_selector.jpgs\5\0\0\344~\341i\...@\vgallery.jpgj\25\0\0:@\17shop_banner.jpg\236k\0\0\343~\341i\...@\vlibrary.jpg\275\n\0\0:@\tbadge.jpg\211\37\0\0\342~\341I:A\trowse.jpg0\t\0\0\344~\341I:@\vdisplay.jpg\245\204\0\0\343~\341I:@\vsegment.jpg\200\4\0\0\344~\341I:?\4.JPG\235Z\1\0\342~\341I\272?\22_market_banner.jpg\260R\0\0\0\0\0\0\0",
>  659) = 659
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> [snip lots of timeouts]
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> select(1, [0], [], NULL, {60, 0})       = 1 (in [0], left {1, 212000})
> read(0, "", 8184)                       = 0
> select(2, NULL, [1], [1], {60, 0})      = 1 (out [1], left {60, 0})
> write(1, "K\0\0\10rsync: connection unexpectedly closed (179 bytes received 
> so far) [sender]\n", 79) = -1 EPIPE (Broken pipe)
> --- SIGPIPE (Broken pipe) @ 0 (0) ---
>   

The client is seeing the connection unexpectedly closed...

> write(2, "rsync: writefd_unbuffered failed to write 79 bytes [sender]: Broken 
> pipe (32)", 77) = -1 EPIPE (Broken pipe)
> --- SIGPIPE (Broken pipe) @ 0 (0) ---
> rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
> rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
> write(2, "rsync error: errors with program diagnostics (code 13) at 
> log.c(237) [sender=3.0.5]", 83) = -1 EPIPE (Broken pipe)
> --- SIGPIPE (Broken pipe) @ 0 (0) ---
> rt_sigaction(SIGUSR1, {SIG_IGN}, NULL, 8) = 0
> rt_sigaction(SIGUSR2, {SIG_IGN}, NULL, 8) = 0
> gettimeofday({1260871590, 977217}, NULL) = 0
> select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
> gettimeofday({1260871591, 76871}, NULL) = 0
> exit_group(13)                          = ?
>
> - ------------------
>
> And here's the entirety of the xfer log errors:
>
> - ------------------
>
> incr backup started back to 2009-12-12 00:00:08 (backup #0) for directory /
> Running: /usr/bin/ssh -p 8416 -i /engineyard/backuppc/.ssh/id_dsa -o 
> StrictHostKeyChecking=no -q -x -l root 65.74.174.196 /usr/bin/rsync --server 
> --sender --numeric-ids --perms --owner --group -D --links --times 
> --block-size=2048 --recursive . /
> Xfer PIDs are now 24552
> Got remote protocol 30
> Negotiated protocol version 28
> Checksum seed is 1260871202
> Got checksumSeed 0x4b275e22
> Sent include: /[dir]
> Sent include: /[dir]/[dir]
> Sent include: /[dir]/[dir]/[dir]
> Sent include: /[dir]/[dir]/[dir]/[dir]
> Sent exclude: /*
> Sent exclude: /[dir]/*
> Sent exclude: /[dir]/[dir]/*
> Sent exclude: /[dir]/[dir]/[dir]/*
> Got file list: 5099022 entries
> Child PID is 32756
> Xfer PIDs are now 24552,32756
> Sending csums, cnt = 5099022, phase = 0
> Read EOF: 
> Tried again: got 0 bytes
> Child is aborting
> Parent read EOF from child: fatal error!
>   

...and so is the server.  Something in the middle is tearing down the 
TCP connection.  Personally, I like to blame firewalls for these sorts 
of shenanigans.  Sometimes I'm even right.  ;o)

> Sending csums, cnt = 2251037, phase = 1
> Done: 0 files, 0 bytes
> Got fatal error during xfer (Child exited prematurely)
> Backup aborted (Child exited prematurely)
>
> - ------------------
>
> I see no sign of system level troubles on either client or server.
> Unfortunately, I wasn't running strace on the other large backup,
> but it also failed, at a completely different time.
>
> The really fun part is that the date when the strace exited (was
> doing "strace -p NUM ; date") is 6 hours before the BackupPC server
> claims that the backup aborted.  My ClientTimeout is set to 72000;
> both backups aborted significantly *after* the twenty hour mark.
> It's not relevant anyways, though; the connection was clearly
> broken on the client end long before BackupPC timed out.
>
> I'm totally willing to accept that the problem might be hardware or
> software config on my end, but:
>
> 1.  It seems to only happen with incrementals.
>
> 2.  I have no idea even where to look; everything looks fine at a
> system level as I understand it.  I don't have the networking skill
> to debug the networking end (the two machines are seperate RFC 1918
> address ranges, with a load balancer/firewall associated with each
> (2 total) between them, plus a bunch of switches and so on).
>
> Given that, it seems completely bizarre to me that you all are, I
> dunno, morally offended? that I proposed increases BackupPC's
> resilience to transient errors as a solution.
>   

IMHO, the implementation of your suggested changes seems complex, and 
the benefits are not obvious to the general use case.  Perhaps your 
proposed changes would be beneficial (maybe even beyond specific corner 
cases), but perhaps there are simpler methods of achieving the same goal.

Mostly, I think, the perceived offense stems from a steady stream of 
people who post a couple of emails to the list citing problems they are 
having with BackupPC and suggesting (often times, half baked) solutions 
to those problems without consideration to the difficulty of 
implementing those changes or the impact those changes may have to the 
reliability of the software as it stands*.  For the most part, nothing 
comes of these complaints, beyond a bit of bluster, resulting in a 
little less patience for the next suggestion.

> In the meantime, I guess I'll go try sharding things and hope it
> doesn't overload the client's running production software too much,
> because I don't see another way out of this.I
>
> 'm certainly not interested in maintaining my own patches or fork.
> I'd like to think that if I made my idea run-time optional y'all
> would roll it in, but the response has been so negative I'm worried.
>   

The only person I'm aware of who needs to be convinced of the merit of 
your changes is the originator of the code, Craig Barratt.  Perhaps 
there are other maintainers, but they do not make themselves known as 
frequently on the list.  By far the most prolific posters are fans who 
maintain installations and show their support by trying to help others 
with their installations.  The efficacy of this help varies.

> Also, it's a lot of work.  -_-
>   

Indeed.

> -Robin
>   

Chris

* I make no claims as to the applicability of this description to you or 
your suggestion, nor do I defend the actions or attitudes of anyone on 
this list.  I merely state my observations and give my unsolicited 
opinion on events as I see fit.



------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-users] An idea to fix both SIGPIPE and memory issues with rsync

Reply via email to