On Friday 02 July 2010 14:56:35 Howard Thomson wrote: > Hi Kern, > > On Friday 02 July 2010, Kern Sibbald wrote: > > On Thursday 01 July 2010 21:46:50 Howard Thomson wrote: > > > Hi Kern, > > > > > > On Thursday 01 July 2010, Kern Sibbald wrote: > > > > Hello Howard, > > > > > > > > What does "chunked" backup mean exactly? I am not sure what the high > > > > level concept is here. Bacula can already backup multi-gigabyte > > > > virtual disks, so obviously you are thinking about something > > > > different. > > > > > > The concept that I am calling 'chunked backup' is sub-file incremental > > > backup. > > > > > > Currently, for a 10Gb Virtualbox virtual disk, a Full-backup will > > > backup the whole file. > > > > > > Subsequent incremental backups, where perhaps only 1Mb of the > > > virtual-disk has changed, will backup the entire [10Gb] single file, > > > because it has changed. > > > > > > Bacula currently records a hash-value for the entire file, whereas I am > > > intending, in addition and for appropriately large files, to record a > > > hash-value for sub-file chunks, to be able to selectively not backup > > > those chunks when doing an incremental / differential backup. > > > > OK, now I understand. This is a feature that we are working on -- it is > > actually a form of deduplication. Before implementing it, there are a > > number of things that need to be decided and some important changes in > > Bacula that need to be made. > > > > 1. By the way, I call these "deltas" that is it is some change to the > > originally backed up image that must be applied. However, what is > > different from an Incremental is two things: 1. only a part of the file > > is saved. 2. *all* the deltas must be restored (not just the most recent > > as is what happens for incremental backups). > > > > 2. From the above, you can see that we need some way of marking these as > > deltas rather than incremental. Perhaps it could simply be called a > > "delta" backup level rather than Incremental. > > > > 3. We need to decide how the "deltas" are going to be generated -- there > > needs to be something to figure out what has changed, which means, in > > general, you need access to the previous backups or some form of hashing > > done by deduplication code. > > > > 4. Determine how the deltas are gong to be stored -- actually, IMO, that > > is trivial it just needs a very small amount of code that looks much like > > the sparse file handling code -- we may even be able to use the same > > code. > > > > > I want to use Bacula to do full + incremental backups of my own system, > > > to disk, without separating out virtual-disks into separate backups, > > > with different recycle criteria for space constraint reasons. > > > > > > Current [admittedly] simple-minded incremental backups of my file-tree > > > are much larger than they need to be ... > > > > Yes, much larger. We have some Bacula Systems scripts that help with > > this for VirtualBox, but it is not integrated with Bacula as deltas would > > be. > > > > This whole subject is non-trivial. > > It is certainly non-trivial ... > > Delta backup, to use your terminology, requires: > > 1/ Retrieve file-offset / hash-code pairs for file being backed up
That is pretty straight forward. One just needs to do something similar to what we do for Accurate backup and Base jobs, where information on prior backups is sent to the FD. Of course, currently, we don't keep the file offset as such for data that is backed up. In addition, I believe that you need one more item -- the length of the delta. This would allow us to easily deal with different filesystems or different filesystem block sizes. > > 2/ Generate hash-code for each file-offset otherwise selected to backup That is also straight forward. It could be passed to the SD as a special stream much like Unix attributes that would then be passed on to the Director for insertion in the database. > > 3/ Lookup file-offset in retrieved list and proceed with backup if > either > not found [sparse file chunk not backed up] or found but > different OK, but again, I think we need a delta length. We might want to vary the length of the delta found according to file systems, and such ... > > 4/ Store all newly generated file-offset / hash-code pairs to the > database. That is also straight forward. We would just implement a new stream that is coming to the Director from the SD -- much like Unix attributes. It would be just a different kind of database update. > > Restore, of a delta backed-up file requires: > > 5/ Retrieve jobid (?) / file-offset pairs from database > > 6/ For each backup-stream read, selectively restore deltas as needed. > Restoring all deltas, in the right order, would work but be > bandwidth inefficient. > > In looking at all the relevant code, I am finding that the interation with > the database, directly and indirectly, is the least obvious structure to > extend and change ... Well, the most complicated and sensitive is to know in what table one puts the information and to design the database records for that. Then one has to modify the database and write the new routines to put the new data into it. It isn't really hard but requires careful checking. I recently added RestoreObjects in the database for Bacula Enterprise, and if it isn't already in Branch-5.1 (main Community development branch), it will be there sometime in July as we start finalizing off the 5.0.3 release, because we will carefully check what items in Branch-5.1 need to be backported to Branch-5.0 for the 5.0.3 release. One *big* question is exactly how to store this information. Bacula currently has only one means to store multiple records of information about a particular File, and that is the JobMedia records, which effectively serve as the index to where the file data is on the give Volume. I think we will need something similar to the JobMedia record to store the hash, the offset, and the size. Compared to the current Bacula tables, this one could potentially hold an enormous number of records. In typical deduplication software, from what I have read, such tables represent about 30% of the size of all the data backed up. Of course, I don't expect to be doing deltas on every file on the filesystem, but it certainly would be useful for VM images and log files. > > The comment on sparse file handling is, of course, correct and I am > treating delta file backup as a special case of sparse file backup. > > It seems to be the responsibility of the SD to send relevant updates to the > Director, currently at the end of each file. However, the SD has no > knowledge of which file-offsets of a sparse file it has processed on behalf > of an FD, so I am unclear at present as to how, and when, /4 updates to the > database will occur. As I mentioned above, I suspect that would be best handled by a new stream that is sent to the SD, who will then know how to send it to the Director. Obviously as with Unix attributes, the SD will have to have some knowledge of this stream. > > When you say that this is being worked on, is it worth me continuing with > current work-in-progress ? Probably yes, if it interests you, but we need to get the design nailed down and agreed on, before doing any serious coding. > > I haven't altered many files yet in my git repo; I've spent more time > reading code than writing it so far ...! I should have been more precise and say that we are in the design phase of this project, but have not yet started programming -- must finish Enterprise 4.0 and Community 5.0.3 releases first ... What do you think? Kern ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
