The problem is that currently there are three filters defined: compression,
encryption, and sparse file handling.  The current implementation of
compression and sparse file handling both require block boundary
preservation.  Even if zlib streaming could handle the existing block based
data, sparse file handling would be broken.

-----Original Message-----
From: Landon Fuller [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 02, 2006 11:06 AM
To: Robert Nelson
Cc: 'Michael Brennen'; [EMAIL PROTECTED];
bacula-users@lists.sourceforge.net
Subject: Re: [Bacula-users] Encryption/Compression Conflict in CVS


On Nov 2, 2006, at 08:30, Robert Nelson wrote:

> Landon,
>
> I've changed the code so that the encryption code prefixes the data 
> block with a block length prior to encryption.
>
> The decryption code accumulates data until a full data block is 
> decrypted before passing it along to the decompression code.
>
> The code now works for all four scenarios with encryption and
> compression:
> none, encryption, compression, and encryption + compression.   
> Unfortunately
> the code is no longer compatible for previously encrypted backups.
>
> I could add some more code to make the encryption only case work like 
> before.  However, since this is a new feature in 1.39 and there 
> shouldn't be a lot of existing backups, I would prefer to invalidate 
> the previous backups and keep the code simpler.
>
> Also I think we should have a design rule that says any data filters 
> like encryption, compression, etc must maintain the original buffer 
> boundaries.
>
> This will allow us to define arbitrary, dynamically extensible filter 
> stacks in the future.
>
> What do you think?

I was thinking about this on the way to work. My original assumption was
that Bacula used the zlib streaming API to maintain state during file
compression/decompression, but this is not the case. Reality is something
more like this:

Backup:
        - Set up the zlib stream context
        - For each file block (not each file), compress the block via
deflate (stream, Z_FINISH); and reinitialize the stream.
        - After all files (and blocks) are compressed, destroy the stream
context

Restore:
        - For each block, call "uncompress()", which does not handle
streaming.

This is a unfortunate -- reinitializing the stream for each block
significantly degrades compression efficiency, as 1) block boundaries are
dynamic and may be set arbitrarily, 2) the LZ77 algorithm may cross block
boundaries, referring up to 32k of previous input data.  
(http://www.gzip.org/zlib/rfc-deflate.html#overview), 3) The huffman coding
context comprises the entire block, 4) There's no need to limit zlib block
size to bacula's block size.

The next question is this -- as we *should* stream the data, does it make
sense to enforce downstream block boundaries in the upstream filter? I'm
siding in favor requiring streaming support, and thus allowing the
individual filter implementor to worry about their own block buffering,
since they can far better encapsulate necessary state and implementation --
and most already do.

The one other thing I am unsure of is whether the zlib streaming API
correctly handles streams that have been written as per above -- each bacula
data block as an independent 'stream'. If zlib DOES handle this, it should
be possible to modify the backup and restore implementation to use the
stream API correctly while maintaining backwards compatibility. This would
fix the encryption problem AND increase compression efficiency.

With my extremely large database backups, I sure wouldn't mind increased
compression efficiency =)

Some documentation on the zlib API is available here (I had a little
difficulty googling this):
        
http://www.freestandards.org/spec/booksets/LSB-Core-generic/LSB-Core-generic
/libzman.html

Cheers,
Landon



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to