Ben Reser wrote:

> Speaking with Julian here at ApacheCon he mentioned that gzip has a
> rsyncable option.  Looking into this turns out that there is a patch
> applied to Debian's gzip that provides this option.  It resets the
> compression algorithm every 1000 bytes and thus makes blocks that can

Use of such a zip format would be ideal -- Subversion's binary-delta would then 
calculate an excellent delta as long as each inserted chunk is are smaller than 
the delta window size (currently 100 KB, Stefan's proposal 1 MB).

I'm not sure about the details of how the restartable compression works, but it 
somehow selects points in the uncompressed data that don't depend on the 
absolute byte offset from the start of the file, and resets the compression at 
those points.

As I understand it, only the compressor needs the special logic, and the 
resulting compressed file is still in the same format and fully compatible with 
the standard decompression libraries.

But unfortunately although patches for this "restartable" or "rsyncable" mode 
of compression has been around for years, and it can have a very low overhead, 
nevertheless it doesn't yet seem to have been implemented in the common 
compression libraries (such as zlib), and OpenOffice doesn't offer that mode.

Therefore this is not a practical solution at the moment.

> be saved between revisions of the file.  gzip uses the same DEFLATE
> algorithm that most zip files use, so the same idea could be applied
> to it.  If we want to deal with something like this in Subversion, I
> think we'd deal with it via some sort of plugin for specific file
> types that could convert to the more efficient to deltify encoding
> before committing.  Unfortunately, we don't have any sort of plugin
> type infrastructure for this today.

Yes, a client-side plug-in -- either to Subversion or to OpenOffice -- seems to 
me the best practical solution.

There exists a plug-in to OpenOffice, "OOoSVN", which, when you want to commit 
the current version of the doc that you are editing, uncompresses the doc file 
into a tree of files in its own private svn working copy (that it creates in 
your home directory) and commits that.  Similarly, to update your doc to an old 
version, or to retrieve two versions and diff them, it updates this hidden WC 
and then compresses the files in the WC into a ".odt" or whatever, and lets 
OpenOffice load or diff that file.

I have tried "OOoSVN" and it works but it is very crude -- the user interface 
is poor and it is not flexible -- it only supports a local dedicated svn 
repository, for example.

> Even still there are things that can be done today.  I made a couple
> trivial Microsoft Office Word documents.  One with the characters
> "abc" in them and one with "abcdef" in it.  I saved the 
> files in .docx
> and in the 2003 flat XML format.  The .docx file produced a delta of
> 3262 bytes, the .xml format produced a file with a delta of just 358
> bytes.
> 
> OpenOffice/LibreOffice support flat versions of their format (e.g.
> .fodt) that are not compressed and can also be more efficiently stored
> in Subversion.  LibreOffice even has a wiki about this:
> https://wiki.documentfoundation.org/Libreoffice_and_subversion

We should talk to the OpenOffice folks and see if we can convince them of the 
value of using a restartable compression by default, and find out how possible 
that is.  It would be great if that Wiki page could even say, "We'd like to use 
restartable compression for this reason but we need the compression library 
developers to make it available."

But for a practical solution until restartable compression becomes the norm (if 
it ever does), if you (Magnus) would like to help by designing some kind of 
solution, that would be great.  Please do keep discussing it here if you have 
any thoughts in this direction.  FWIW I think it's an important and interesting 
issue.

- Julian

Reply via email to