Re: [BackupPC-devel] BackupPC v3 and v4 Development (fwd)

Adam Goryachev Mon, 11 Jan 2016 15:24:19 -0800

On 12/01/16 08:59, Michael wrote:
> Hi Stephen,
> Hi all,
>
> Thanks for your feedback and sharing your experience.
> Here some clarification on my side.
>
>
> *** Regarding BPC being not reliable.
>
> I don't deny that BPC works very  well in many situations, and can sustain
> heavy load etc. But in my case it was not the flawless setup I imagined at
> first. Of course, the main problem  is the small amount of memory. Looking
> in the dmesg logs, I could spot regularly OOM messages, the kernel killing
> the backuppc_dump process, etc.  Now it is a bit unfair  of me blaming BPC
> when actually  the main culprit  is the lack of  memory. But the  thing is
> that BPC  is quite unhelpful in  these situations. Server logs  are mostly
> useless, no  timestamps, and  there is  no attempt  to restart  again, but
> instead BPC goes into a long  process of counting references, etc, meaning
> most  of the  server time  is  spent in  (apparently) unproductive  tasks.
> Again, the main  culprit is the platform, and actually  BPC never lost any
> data (afaik), and  always recovered somehow. Still, there  are some traces
> of corruption  in the db (like  warning about reference being  equal to -1
> instead of 0), indicating that maybe BPC is not atomic.


I think adding timestamps to logs is not a significant problem, and 
shouldn't be difficult to do. However, which log entries deserve a 
timestamp? Every single one? Let's assume a timestamp in this format 
20160112-094440 ("YYYYMMDD-HHMMSS "), that is just 16 bytes per line, 
plus some small overhead to lookup and format the data. For a logfile of 
3 million files, that's 48MB of timestamps you just added to an already 
massive log file. Maybe we could add a timestamp every minute or every x 
minutes? However, then you make the log harder to parse because each 
line is not consistent.... Maybe start a thread on this specific issue, 
and lets discuss the options, and see what most people think is the most 
useful.

> *** Regarding the 256MB requirement.
>
> Admittedly this  is very demanding  and very  far off most  feedbacks I've
> seen on the net.  Stephen's setup of a recycled Dell PE  1950 (8 cores, 16
> GB RAM)  seems more typical  than my  poor lacie-cloudbox with  256MB. But
> when I  monitor memory  usage, for  instance when doing  a full  backup of
> 600k-file  client, BPC  dump/rsync processes  consume consistently  around
> 100MB memory (htop/smem). Looking at rsync  page, they say that rsync 3.0+
> should consume 100bytes/file, so a total of 60MB for rsync. So I don't see
> any blocking  point why BPC  would not fit in  a 256-MB memory  budget. Of
> course it  will be slower, but  it must work.  This WE I again  spent some
> time tuning down the Lacie-Cloudbox, stripping away all useless processes,
> like those  hungry python stuff.  Now, in idle,  the Lacie has  200MB free
> physical + 200MB free swap space, and  in that setup BPC worked for 2 days
> w/o  crash doing  a full  55GB,  650k files  backup, in  2 hours,  7.5MB/s
> (almost  no changes,  hence the  very  high speed  of course).  For now  I
> disabled all my machines but a few, and will enable the remaining ones one
> by one. I have good hope that it  will work again. Now, my wishes would be
> to restore  some other services,  but this will likely  require increasing
> the swap space.

I don't think main development of BPC should be setup around what is 
essentially an embedded platform. However, if there is some random piece 
of BPC which is allocating memory where it isn't needed, then definitely 
that can be looked at. So far, what I have seen BPC failing on is a 
single directory with a large number of files (ie, 977386 currently 
which has not succeeded in some time, unfortunately, the application 
requires all files in a single directory). However, scaling the 
requirements with the hardware is not a bad goal, even better if minimal 
hardware can backup any target with the only effect being a longer time 
to complete. Do we *really* need the entire list of files in RAM? Isn't 
that the point of the newer rsync which doesn't need to pre-load the 
entire list of files? Can't we just process "one file at a time" with a 
look-ahead of 1000 files (seems to be what rsync does already)?

I expect this type of work to be a lot more complicated/involved. Don't 
expect a lot of help, as few people are going to be interested in this 
goal. Developers tend to scratch their own itches...

> *** Regarding BPC being slow
>
> I only give BPC 256MB, so  I should expect too much regarding performance.
> That I fully agree. However when I say  it is slow, I mean it is slow even
> if I take into account that fact.  Transfer speed is ok-ish; it uses rsync
> at its best,  which requires some heavy processing sometimes.  But I don't
> understand  why  it needs  so  much  time  for  the remaining  tasks  (ref
> counting, etc). I'm actually convinced  (perhaps naively) that this can be
> significantly improved. See further down.
I agree, I think v4 is doing refcounts that are not needed. I saw a 
email recently (last few months) noting that v4 will refcount *all* 
backups for the host, instead of only the backup that was just 
completed. The current host I'm working on takes hours just for the 
refcnt after a backup, this also likely involves a LOT of random I/O as 
well.
> *** Regarding "trashing" rsync
>   + @kosowsky.org about designing a totally new backup program
>
> My statement  was... too brutal I  guess ;-) I fully  agree with Stephen's
> comment. And I don't  want to create a new program  from scratch. rsync is
> one  of  the best  open-source  sync  program.  Unison and  duplicity  are
> basically  using rsync  internally. I  do think  however that  BPC can  be
> significantly improved.
I think what I'd like to see here is the ability to add some 
"intelligence" to the client side. Whether we need a BPC client or not, 
I'm not sure, but currently BPC doesn't seem to "continue" a backup 
well. Some cloud sync apps seem better at doing small incremental 
uploads which eventually will get a consistent confirmed backup. This 
includes backing up really large single files, BPC will discard and 
re-start the file, instead of knowing that "this" is only half the file, 
and that it should continue on the next backup.
1) Consider a file 1GB in size, on the first time BPC sees the file, it 
starts a full transfer and manages to download 300MB before a network 
hiccup, or timeout happens. BPC can add the file to the pool, and save 
the file into the partial backup, marking the file as incomplete. 
However, lets say a disaster strikes, isn't it better to recover the 
first 300MB of the file rather than nothing?

2) In the scenario of BPC has a complete/valid backup of this 1GB file, 
but it has changed. BPC/rsync starts to transfer the file, we complete 
the changes in the first 300MB of the file before the same network 
hiccup/timeout. Again, why not keep the file, and mark as incomplete? 
Next time, rsync will quickly skip the first 300MB, and continue the 
backup of the rest of the file. In a disaster, you have the choice to 
restore the incomplete file from the partial backup, or the complete 
file from the previous backup, or both and then forensically 
examine/deal with the differences to potentially recover a bunch of data 
you may not otherwise have had access to.

> - Flatten the backup hierarchy
>
>    Initially BPC  was a  "mere" wrapper around  rsync. First  duplicating a
>    complete hierarchy with  hard-links, then rsync'ing over it.  It had the
>    advantage of simplicity but is very  slow to maintain, and impossible to
>    duplicate.  Now  the  trend in  4.0alpha  was  to  move  to a  custom  C
>    implementation of  rsync, where  hierarchy only  stores attrib  files. I
>    think that we  can improve the maintenance phase  further (ref counting,
>    backup deletion...)  by flattening this  structure into a  single linear
>    file, and  by listing  once for  all the references  in a  given backup,
>    possibly  with caching  of references  per directory.  Directory entries
>    would be  more like git objects,  attaching a name to  a reference along
>    with  some  metadata. This  means  integrating  further with  the  inner
>    working of rsync. It would be fully compliant with rsync from the client
>    side.  But refcounting  and backup  deletion should  then be  equivalent
>    to  sorting and  finding  duplicate/unique entries,  which  can be  very
>    efficient. Even  on my Lacie  sorting a  600k-line file with  32B random
>    hash entries takes only a couple seconds.
Wouldn't that require loading all 600k lines into memory? What if you 
had 100 million entries or 100 billion? I think by the time you get to 
looking at that, you are better off using a proper DB to store that 
data, they are much better designed to handle sorting/random access of 
data that some flat text file.

This might be something better looked at in BPC v5, as it's likely to be 
a fairly large achitectural change. I'd need to read a lot more about 
the v4 specific on-disk formats to comment further...
> - Client-side sync
>
>    Sure, this  must be  an optional feature,  and I agree  this is  not the
>    priority. Many  clients will still  simply run rsyncd or  rsync/ssh. But
>    the client-side sync would allow  to detect hard links more efficiently.
>    It will also  decrease memory usage on the server  (see rsync faq). Then
>    it opens  up a  whole new  set of  optimization, delta-diff  on multiple
>    files...
Yes, also, it mostly works well as simply "another" BPC protocol that 
can sit alongside tar/rsync/smb/etc... However, finding the developers 
to work on this, and then maintain it in the long term? A *nix client 
may not be so difficult, but a windows client might be more useful but 
harder.... A definite project all by itself!

> *** Regarding writing in C
>
> Ok,  I'm not  a  perl fan.  But  I agree,  it is  useful  for stuff  where
> performance  does not  matter, for  website  interface, etc.  But I  would
> rewrite in C the ref counting part and similar.

I suppose the question is how much performance improvement will this get 
you? It is possible to embed C within perl, and possible to pre-compile 
a perl script into a standalone executable. So certainly re-writing 
sections in C is not impossible. For example, you want to re-write the 
ref counting part, I suspect this is mostly disk I/O constrained rather 
than CPU/code constrained, so I doubt you would see any real performance 
improvement. I expect the best way to improve performance on this part 
is to improve/fix the algorithm, and then translate that improvement 
into the code.

eg, as was reported (by someone else that I can't recall right now) if 
you have 100 backups saved for this host, and you finish a new backup 
(whether completed or partial) then you redo the refcnt for all 101 
backups. If this was changed to only redo the refcnt for the current 
backup, then you are 100 times faster. Better than any improvement by 
changing the language.

Finally, I wonder whether we will have more (or less) people able (and 
actually doing it) to contribute code if it is written in perl or C? I 
suspect the pragmatic process will be to keep it in perl and just patch 
the things needed. Over time, some performance reliant components could 
be re-written into C and embedded into the existing perl system. 
Eventually, the final step could be taken to convert the remaining 
portions into C.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-devel] BackupPC v3 and v4 Development (fwd)

Reply via email to