Hi Kern,

Well, your implication that no work is being done on item 1 at least
implies that my research and work so far is not being duplicated
elsewhere ... !

I did some work on item 2, but didn't get very far before deciding that,
for the basis of data de-dup that I was implementing, work on item 3
would be a more readily implementable solution. I note, however, that an
item 2 implementation would be beneficial in substantially reducing the
bandwidth required to do backup across the network, beneficial to
on-the-move laptop backup for instance.

Apart from ZFS, which is not yet a default install for Linux systems
[although available via ZFS-Fuse], and for which the FreeBSD version
does not yet include block level de-dup, the other free software
de-duplication file-system seem not to be production ready ...
FreeBSD is at ZFS version 14, whereas deduplication was introduced in
ZFS version 21.

For item 3, are there any notes / documents / wiki pages available with
options considered and decisions made so far ?

Regards,

Howard



-----Original Message-----
From: Kern Sibbald <k...@sibbald.com>
To: bacula-devel@lists.sourceforge.net, howard.thom...@dial.pipex.com
Subject: Re: [Bacula-devel] De-duplication friendly Volume Format change
Date: Sat, 1 Jan 2011 18:25:42 +0100

Hello,

The first thing that one must do is specify what problem of deduplication one 
is trying to resolve:

1. Deduplication by the Bacula Storage daemon

2. Deduplication in the Bacula Client (File daemon)

3. Deduplication by the underlying filesystem where the SD writes data (e.g. 
ZFS).

Currently we are working on item 2, and have a design for a new (additional) 
Volume format for item 3.  Exactly when these features will be available is 
completely open at this point.

Item 1 is probably something that will never be needed due to the fact that 
there are more and more very good filesystems that already do the job 
especially if a new (additional) Volume format were to be implemented.

I've noticed that a few months after we discussed various features, the same 
thing was implemented by Zmanda, so I am a bit reluctant to give any details.
However, if there are programmers that want do development, we would be happy 
to discuss off list.  Please keep in mind that we sometimes receive patches 
that programmers have made without discussing it with us, and often such 
patches are not appropriate for Bacula for lots of reasons: limited to a 
particular OS, doesn't respect coding standards, is not scalable, doesn't fit 
Bacula way of doing things, doesn't use Bacula "infrastructure" (mostly 
libbac.so), ...

So before you start writing code, please discuss it with us first ...

Best regards,

Kern


On Saturday 01 January 2011 15:43:16 Howard Thomson wrote:
> Hi,
>
> I have been thinking about, and working on aspects of de-duplication for
> the Bacula Storage Daemon, following my talk at the Bacula Developer's
> Conference in September.
>
> Both strands of work involve making changes to the Volume Format:
>
> Firstly, the current strictly serialized Volume without alignment
> considerations, does not store data blocks [4kb aligned on 4kb file
> boundary] aligned similarly on disk Volumes, other than incidentally.
>
> See the blog at:
> [http://blog.myunix.dk/2010/12/15/large-scale-disk-to-disk-backups-using-ba
>cula-part-vi/]
>
> An improvement, enabling the underlying file-system to do disk block
> de-duplication [as with ZFS] would be to split the process of packing
> stream records to the Volume between data and non-data streams:
>
> Block:
>       Block Header
>       Record Header
>       <Packed non-data streams>
>       <Record Header for aligned data, ends on 4kb alignment>
>       <Aligned 4kb data block(s)>
>
>       repeat ...
>
> Details needed for file tails and whole files <4kb
>
> Would a change to "BB03" be an appropriate designation for the Volume
> labelling to indicate the processing required ?
>
> Secondly, I am working on a, disk only, volume format where the data
> streams are stored independently of the volume, and the volume only
> contains the sequence of [SHA1/SHA256] hashes [+ size/offset] that
> regenerate the file content.
>
> The concept, although not implementation, is from the 'bup' package on
> Sourceforge, and uses a cyclic CRC to generate span selections of the
> data to hash and store, averaging 8kb in size for the initial
> implementation.
>
> Similar issues arise in how to specify the volume format to the SD.
>
> Suggestions ? New stream IDs ? Label changes ?
>
> Thirdly, is anyone else working along similar lines ?
>
> Regards, and Happy New Year,
>
> Howard



-- 
Howard Thomson <howard.thom...@dial.pipex.com>


------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to