Hello Howard,

On Thursday 01 July 2010 18:42:37 Howard Thomson wrote:
> Hi Kern & Bacula-developers,
>
> I have been working on changes to Bacula to enable chunked backup for large
> files, such as multi-gigabyte virtual disks [which I have], and possibly
> database files etc.

What does "chunked" backup mean exactly?  I am not sure what the high level 
concept is here.  Bacula can already backup multi-gigabyte virtual disks, so 
obviously you are thinking about something different.

>
> I need to establish how and when the per-chunk hash values are retrieved
> from the database and stored/updated to the database.

It sounds a bit like you are trying to implement some sort of deduplication 
code, but I am not sure.

>
> I am starting with backup changes, for obvious reasons, and note that the
> data stream from FD -> SD is a single contiguous stream, albeit transferred
> in record sized pieces.

Actually, it is not a single continguous stream -- it is lots of packets.  
Within those packets are contained multiple streams of data (one at a time).  
The protocol is adaptable to pretty much anything.

>
> I was envisaging alternate data-chunk / chunk-hash transfers, but that does
> not fit as easily into the existing code as I had hoped
> [src/stored/append.c and src/filed/backup.c].

Transferring blocks of data (or chunks) should really be no problem.  Bacula 
currently does essentially that, but it is designed to use very little memory 
so it does not accumulate the whole contents of a file before writing it 
out -- it simply writes out what it has as it comes in.

>
> Does the per-chunk hash value info also need to go onto the storage media
> as to the database ?

Sorry, I cannot answer that until I understand what a chunk is and what the 
hash code is used for -- i.e. when it is needed and why.  If you are talking 
about deduplication, it is a very big project which will need a lot of 
careful design work before implementing.

>
> If it does, then I could simply accumulate the file-offset/hash-value pairs
> and send them as a separate stream after the data, although that may be
> less than ideal in memory consumption terms.

There are some specific cases where Bacula accumulates things such as hash 
codes, but we try to avoid it if at all possible because when it does so, it 
immediately makes limitations on what Bacula can handle.

>
> For restore, the current code is configured such that the SD is unaware of
> the file-offset values for a sparse data stream, which means that the SD
> would be unable to be selective about the data which it sends to the FD,
> which is somewhat link-inefficient.

Yes, we have tried to make the SD know as little as possible about the format 
of the data.  Its job is to store data on disk or tape, then to restore it 
and send it to the FD.  It is the Director which tells the SD what data to 
retrieve.

>
> Any comments ?
>
> Will you [Kern] be at the Amsterdam meeting at all ?

No, I will not be attending.  Arno will be there though.

Best regards,

Kern

>
> Regards,
>
> Howard
>
> --
> "Only two things are infinite, the universe and human stupidity,
> and I'm not sure about the former." -- Albert Einstein



------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Reply via email to