Hi Kern, Well, your implication that no work is being done on item 1 at least implies that my research and work so far is not being duplicated elsewhere ... !
I did some work on item 2, but didn't get very far before deciding that, for the basis of data de-dup that I was implementing, work on item 3 would be a more readily implementable solution. I note, however, that an item 2 implementation would be beneficial in substantially reducing the bandwidth required to do backup across the network, beneficial to on-the-move laptop backup for instance. Apart from ZFS, which is not yet a default install for Linux systems [although available via ZFS-Fuse], and for which the FreeBSD version does not yet include block level de-dup, the other free software de-duplication file-system seem not to be production ready ... FreeBSD is at ZFS version 14, whereas deduplication was introduced in ZFS version 21. For item 3, are there any notes / documents / wiki pages available with options considered and decisions made so far ? Regards, Howard -----Original Message----- From: Kern Sibbald <k...@sibbald.com> To: bacula-devel@lists.sourceforge.net, howard.thom...@dial.pipex.com Subject: Re: [Bacula-devel] De-duplication friendly Volume Format change Date: Sat, 1 Jan 2011 18:25:42 +0100 Hello, The first thing that one must do is specify what problem of deduplication one is trying to resolve: 1. Deduplication by the Bacula Storage daemon 2. Deduplication in the Bacula Client (File daemon) 3. Deduplication by the underlying filesystem where the SD writes data (e.g. ZFS). Currently we are working on item 2, and have a design for a new (additional) Volume format for item 3. Exactly when these features will be available is completely open at this point. Item 1 is probably something that will never be needed due to the fact that there are more and more very good filesystems that already do the job especially if a new (additional) Volume format were to be implemented. I've noticed that a few months after we discussed various features, the same thing was implemented by Zmanda, so I am a bit reluctant to give any details. However, if there are programmers that want do development, we would be happy to discuss off list. Please keep in mind that we sometimes receive patches that programmers have made without discussing it with us, and often such patches are not appropriate for Bacula for lots of reasons: limited to a particular OS, doesn't respect coding standards, is not scalable, doesn't fit Bacula way of doing things, doesn't use Bacula "infrastructure" (mostly libbac.so), ... So before you start writing code, please discuss it with us first ... Best regards, Kern On Saturday 01 January 2011 15:43:16 Howard Thomson wrote: > Hi, > > I have been thinking about, and working on aspects of de-duplication for > the Bacula Storage Daemon, following my talk at the Bacula Developer's > Conference in September. > > Both strands of work involve making changes to the Volume Format: > > Firstly, the current strictly serialized Volume without alignment > considerations, does not store data blocks [4kb aligned on 4kb file > boundary] aligned similarly on disk Volumes, other than incidentally. > > See the blog at: > [http://blog.myunix.dk/2010/12/15/large-scale-disk-to-disk-backups-using-ba >cula-part-vi/] > > An improvement, enabling the underlying file-system to do disk block > de-duplication [as with ZFS] would be to split the process of packing > stream records to the Volume between data and non-data streams: > > Block: > Block Header > Record Header > <Packed non-data streams> > <Record Header for aligned data, ends on 4kb alignment> > <Aligned 4kb data block(s)> > > repeat ... > > Details needed for file tails and whole files <4kb > > Would a change to "BB03" be an appropriate designation for the Volume > labelling to indicate the processing required ? > > Secondly, I am working on a, disk only, volume format where the data > streams are stored independently of the volume, and the volume only > contains the sequence of [SHA1/SHA256] hashes [+ size/offset] that > regenerate the file content. > > The concept, although not implementation, is from the 'bup' package on > Sourceforge, and uses a cyclic CRC to generate span selections of the > data to hash and store, averaging 8kb in size for the initial > implementation. > > Similar issues arise in how to specify the volume format to the SD. > > Suggestions ? New stream IDs ? Label changes ? > > Thirdly, is anyone else working along similar lines ? > > Regards, and Happy New Year, > > Howard -- Howard Thomson <howard.thom...@dial.pipex.com> ------------------------------------------------------------------------------ Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel