On 12/01/16 08:59, Michael wrote: > Hi Stephen, > Hi all, > > Thanks for your feedback and sharing your experience. > Here some clarification on my side. > > > *** Regarding BPC being not reliable. > > I don't deny that BPC works very well in many situations, and can sustain > heavy load etc. But in my case it was not the flawless setup I imagined at > first. Of course, the main problem is the small amount of memory. Looking > in the dmesg logs, I could spot regularly OOM messages, the kernel killing > the backuppc_dump process, etc. Now it is a bit unfair of me blaming BPC > when actually the main culprit is the lack of memory. But the thing is > that BPC is quite unhelpful in these situations. Server logs are mostly > useless, no timestamps, and there is no attempt to restart again, but > instead BPC goes into a long process of counting references, etc, meaning > most of the server time is spent in (apparently) unproductive tasks. > Again, the main culprit is the platform, and actually BPC never lost any > data (afaik), and always recovered somehow. Still, there are some traces > of corruption in the db (like warning about reference being equal to -1 > instead of 0), indicating that maybe BPC is not atomic.
I think adding timestamps to logs is not a significant problem, and shouldn't be difficult to do. However, which log entries deserve a timestamp? Every single one? Let's assume a timestamp in this format 20160112-094440 ("YYYYMMDD-HHMMSS "), that is just 16 bytes per line, plus some small overhead to lookup and format the data. For a logfile of 3 million files, that's 48MB of timestamps you just added to an already massive log file. Maybe we could add a timestamp every minute or every x minutes? However, then you make the log harder to parse because each line is not consistent.... Maybe start a thread on this specific issue, and lets discuss the options, and see what most people think is the most useful. > *** Regarding the 256MB requirement. > > Admittedly this is very demanding and very far off most feedbacks I've > seen on the net. Stephen's setup of a recycled Dell PE 1950 (8 cores, 16 > GB RAM) seems more typical than my poor lacie-cloudbox with 256MB. But > when I monitor memory usage, for instance when doing a full backup of > 600k-file client, BPC dump/rsync processes consume consistently around > 100MB memory (htop/smem). Looking at rsync page, they say that rsync 3.0+ > should consume 100bytes/file, so a total of 60MB for rsync. So I don't see > any blocking point why BPC would not fit in a 256-MB memory budget. Of > course it will be slower, but it must work. This WE I again spent some > time tuning down the Lacie-Cloudbox, stripping away all useless processes, > like those hungry python stuff. Now, in idle, the Lacie has 200MB free > physical + 200MB free swap space, and in that setup BPC worked for 2 days > w/o crash doing a full 55GB, 650k files backup, in 2 hours, 7.5MB/s > (almost no changes, hence the very high speed of course). For now I > disabled all my machines but a few, and will enable the remaining ones one > by one. I have good hope that it will work again. Now, my wishes would be > to restore some other services, but this will likely require increasing > the swap space. I don't think main development of BPC should be setup around what is essentially an embedded platform. However, if there is some random piece of BPC which is allocating memory where it isn't needed, then definitely that can be looked at. So far, what I have seen BPC failing on is a single directory with a large number of files (ie, 977386 currently which has not succeeded in some time, unfortunately, the application requires all files in a single directory). However, scaling the requirements with the hardware is not a bad goal, even better if minimal hardware can backup any target with the only effect being a longer time to complete. Do we *really* need the entire list of files in RAM? Isn't that the point of the newer rsync which doesn't need to pre-load the entire list of files? Can't we just process "one file at a time" with a look-ahead of 1000 files (seems to be what rsync does already)? I expect this type of work to be a lot more complicated/involved. Don't expect a lot of help, as few people are going to be interested in this goal. Developers tend to scratch their own itches... > *** Regarding BPC being slow > > I only give BPC 256MB, so I should expect too much regarding performance. > That I fully agree. However when I say it is slow, I mean it is slow even > if I take into account that fact. Transfer speed is ok-ish; it uses rsync > at its best, which requires some heavy processing sometimes. But I don't > understand why it needs so much time for the remaining tasks (ref > counting, etc). I'm actually convinced (perhaps naively) that this can be > significantly improved. See further down. I agree, I think v4 is doing refcounts that are not needed. I saw a email recently (last few months) noting that v4 will refcount *all* backups for the host, instead of only the backup that was just completed. The current host I'm working on takes hours just for the refcnt after a backup, this also likely involves a LOT of random I/O as well. > *** Regarding "trashing" rsync > + @kosowsky.org about designing a totally new backup program > > My statement was... too brutal I guess ;-) I fully agree with Stephen's > comment. And I don't want to create a new program from scratch. rsync is > one of the best open-source sync program. Unison and duplicity are > basically using rsync internally. I do think however that BPC can be > significantly improved. I think what I'd like to see here is the ability to add some "intelligence" to the client side. Whether we need a BPC client or not, I'm not sure, but currently BPC doesn't seem to "continue" a backup well. Some cloud sync apps seem better at doing small incremental uploads which eventually will get a consistent confirmed backup. This includes backing up really large single files, BPC will discard and re-start the file, instead of knowing that "this" is only half the file, and that it should continue on the next backup. 1) Consider a file 1GB in size, on the first time BPC sees the file, it starts a full transfer and manages to download 300MB before a network hiccup, or timeout happens. BPC can add the file to the pool, and save the file into the partial backup, marking the file as incomplete. However, lets say a disaster strikes, isn't it better to recover the first 300MB of the file rather than nothing? 2) In the scenario of BPC has a complete/valid backup of this 1GB file, but it has changed. BPC/rsync starts to transfer the file, we complete the changes in the first 300MB of the file before the same network hiccup/timeout. Again, why not keep the file, and mark as incomplete? Next time, rsync will quickly skip the first 300MB, and continue the backup of the rest of the file. In a disaster, you have the choice to restore the incomplete file from the partial backup, or the complete file from the previous backup, or both and then forensically examine/deal with the differences to potentially recover a bunch of data you may not otherwise have had access to. > - Flatten the backup hierarchy > > Initially BPC was a "mere" wrapper around rsync. First duplicating a > complete hierarchy with hard-links, then rsync'ing over it. It had the > advantage of simplicity but is very slow to maintain, and impossible to > duplicate. Now the trend in 4.0alpha was to move to a custom C > implementation of rsync, where hierarchy only stores attrib files. I > think that we can improve the maintenance phase further (ref counting, > backup deletion...) by flattening this structure into a single linear > file, and by listing once for all the references in a given backup, > possibly with caching of references per directory. Directory entries > would be more like git objects, attaching a name to a reference along > with some metadata. This means integrating further with the inner > working of rsync. It would be fully compliant with rsync from the client > side. But refcounting and backup deletion should then be equivalent > to sorting and finding duplicate/unique entries, which can be very > efficient. Even on my Lacie sorting a 600k-line file with 32B random > hash entries takes only a couple seconds. Wouldn't that require loading all 600k lines into memory? What if you had 100 million entries or 100 billion? I think by the time you get to looking at that, you are better off using a proper DB to store that data, they are much better designed to handle sorting/random access of data that some flat text file. This might be something better looked at in BPC v5, as it's likely to be a fairly large achitectural change. I'd need to read a lot more about the v4 specific on-disk formats to comment further... > - Client-side sync > > Sure, this must be an optional feature, and I agree this is not the > priority. Many clients will still simply run rsyncd or rsync/ssh. But > the client-side sync would allow to detect hard links more efficiently. > It will also decrease memory usage on the server (see rsync faq). Then > it opens up a whole new set of optimization, delta-diff on multiple > files... Yes, also, it mostly works well as simply "another" BPC protocol that can sit alongside tar/rsync/smb/etc... However, finding the developers to work on this, and then maintain it in the long term? A *nix client may not be so difficult, but a windows client might be more useful but harder.... A definite project all by itself! > *** Regarding writing in C > > Ok, I'm not a perl fan. But I agree, it is useful for stuff where > performance does not matter, for website interface, etc. But I would > rewrite in C the ref counting part and similar. I suppose the question is how much performance improvement will this get you? It is possible to embed C within perl, and possible to pre-compile a perl script into a standalone executable. So certainly re-writing sections in C is not impossible. For example, you want to re-write the ref counting part, I suspect this is mostly disk I/O constrained rather than CPU/code constrained, so I doubt you would see any real performance improvement. I expect the best way to improve performance on this part is to improve/fix the algorithm, and then translate that improvement into the code. eg, as was reported (by someone else that I can't recall right now) if you have 100 backups saved for this host, and you finish a new backup (whether completed or partial) then you redo the refcnt for all 101 backups. If this was changed to only redo the refcnt for the current backup, then you are 100 times faster. Better than any improvement by changing the language. Finally, I wonder whether we will have more (or less) people able (and actually doing it) to contribute code if it is written in perl or C? I suspect the pragmatic process will be to keep it in perl and just patch the things needed. Over time, some performance reliant components could be re-written into C and embedded into the existing perl system. Eventually, the final step could be taken to convert the remaining portions into C. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 _______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/