Re: [BackupPC-devel] BackupPC 4.0 features - reference counting

Tino Schwarze Thu, 03 Mar 2011 02:03:23 -0800

Hi Craig & Jeffrey,

On Wed, Mar 02, 2011 at 09:17:26PM -0500, Jeffrey J. Kosowsky wrote:


>  > However, the benefits significantly outweigh the drawbacks:
>  > 
>  >  - eliminating hardlinks means the backup storage is much easier
>  >    to replicate, copy or restore.
> 
> TOTALLY AWESOME. This should reduce the traffic on the BackuppPC
> newslist by about 25% just by eliminating this FAQ and complaint.

I'm looking forward to being able to backup the pool! :-) And maybe I'll
start a new install for 4.0. Hm... Maybe a pool conversion tool wouldn't
be too difficult?

>  >  - determining which pool files can be deleted is much more
>  >    efficient, since only the reference count database needs
>  >    to be searched for reference counts of 0.  It is no longer
>  >    necessary to stat() every file in the pool, which is very
>  >    time consuming on large pools.
> 
> Love these improvements in performance.
>  > 
>  > It is not necessary to update the reference counts in real time, so
>  > the implementation is a lot simpler and more efficient.  In fact,
>  > the reference count updating is done as part of the BackupPC_nightly
>  > process.
>  > 
>  > The reference count database is stored in 128 different files,
>  > based on the first byte of the digest anded with 0xfe.  Therefore
>  > the file:
>  > 
>  >     CPOOL_DIR/4e/poolCnt
>  > 
>  > stores all the reference counts for digests that start with 0x4e or
>  > 0x4f.  The file itself is the result of using Storable::store() on a
>  > hash whose key is the digest and value is the reference count.  This
>  > is a compact format for storing the perl data structure.  The entire
>  > file is read or written in a batch like manner - it is not intended
>  > for dynamic updates of individual entries.
> 
> Why use only 7 bits (and with 0xfe) rather than 8 bits (and with 0xff)?

I wondered the same. I just took a look - my pool has 9 million files
(and I'm sure there are significantly larger pools in the wild), so 9
million / 128 = is roughly 70.000 files per file. Digest is 32 bytes,
IIRC, refcount is 8 byte, so we've got roughly 42 bytes/entry ;-) which
means almost 3 MB per file. 

We're talking about at least 270 MB of refcounts for the 9 million files!

>  > When backups are done or backups are deleted, a file is created
>  > that records the changes in reference counts.  For example, if a
>  > backup is being done, and a new file is matched to an existing
>  > pool file, then the reference count for that pool file needs to be
>  > incremented.  Similarly, if a backup is deleted so that a given
>  > pool file is no longer referenced, then that reference count needs
>  > to be decremented.  Remember that backups are stored as reverse-time
>  > deltas in 4.x, so there are a few subtle issues about how reference
>  > counts change.  For example, if a file was present in the prior backup,
>  > but has been removed prior to the current backup, then the reference
>  > count doesn't change - the file is simply moved from the current backup
>  > to the prior backup.
> 
> Are you counting the number of times the file appears in a backup
> (filled or unfilled) or are you counting the number of times the file
> appears in an attrib file. It seems to me that the 2nd notion might be
> easier to deal with, since then there is a clear 1-1 correspondance
> between the count and the number of times the file is directly
> referenced in the pc tree.
>  > 
>  > Those "pool reference delta" files are stored in each PC's backup
>  > directory, and also in the trash directory.  There could be many of
>  > these as backups are done and others are deleted.  Their name has
>  > the form "tpoolCntDelta.PID.NNN" as they are being written, where
>  > PID is the process ID of the writing process, and NNN is a number
>  > to ensure the file is unique.  Once the file is closed, it is renamed
>  > to "poolCntDelta.PID.NNN".  Each PC directory and the trash directory
>  > could have several or many of these files.
>  > 
>  > The script bin/BackupPC_refCountUpdate reads all the poolCntDelta*
>  > files in the PC and trash directories, and updates the poolCnt
>  > files below CPOOL_DIR and POOL_DIR.  If it encounters any errors
>  > it does its best to restore all the files to their original
>  > form.

That looks clean and reasonably simple! :-) And the refCountUpdate may
be run at any time and in parallel to backups (but not BackupPC_nightly)
- well done!

Oh wait... unfortunately not. There's a corner case: Consider a file
which is only present in a backup which has just been expired, so the
reference count is to be zeroed during update after BackupPC_trashClean
finished. But a concurrent backup discovers the same file somewhere else
and reuses it. Therefore, you need to ensure that *all* reference deltas
are applied before running BackupPC_nightly which will possibly delete
the file. Okay, refcount updating may be performed in parallel, but not
pool file deletion.

> Is this memory-intensive? i.e. does efficiency require large
> poolCntDelta files and potentially multiple poolCnt files to be stored
> in memory or are they sorted in a way that allows the files to be
> processed chunk-by-chunk.
 
I wondered how this refcount updating could be implemented
efficiently... After all, the hashes distribute access uniformly across
all refcount files. One probably wants to avoid loading everything into
memory as well as reading/updating/writing each file multiple times.

Maybe it could be done two-phase:

1. scan through all pending updates, sort them by
destination-refcount-file (use one temporary file for each poolCnt file,
where only updates for this file are collected by simply appending them)
2. load each poolCnt file with pending updates, apply the collected
updates, remove temporary file

> I ask because I have been able to run BackupPC on arm-based NAS's with
> as little as 64MB of RAM and I am wondering whether 4.x offers
> speedups in terms of disk accesses at the expense of more in-RAM storage.

If the number of poolCnt files would be configurable (at install time),
the amount of required RAM could easily be limited.

After all, everything looks very promising! :-)

Tino.

-- 
"What we nourish flourishes." - "Was wir nähren erblüht."

www.tisc.de

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev 
_______________________________________________
BackupPC-devel mailing list
BackupPC-devel@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-devel
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Re: [BackupPC-devel] BackupPC 4.0 features - reference counting

Reply via email to