AW: Compressed Pristines (Custom Format?)

Markus Schaber Fri, 23 Mar 2012 09:50:25 -0700

Hi,

> -----Ursprüngliche Nachricht-----
 
> The gist of the above is that if we choose a better-than-gz compression
> and combine the files both *before* and *after* compression, we'll have
> much more significant results than what we have now just with gz on
> individual files. This can be noticed using tar.bz2, for example, where
> the result is not unlike what we can achieve with the custom file format
> (although bz2 is probably too slow for our needs).


Maybe xz (lzma2) is the algorithm to look at. It usually has a better ratio for 
cpu_usage/compression_factor, and decompression is nearly as fast as gz.

> I also like the fact that the pristine files are opaque and don't
> encourage the user to mess with them. Markus raised this point as
> "debuggability". I don't see "debuggability" as a user requirement (it is
> justifiably an SVN dev/maintainer requirement) and I don't find reason to
> add it as one. On the contrary, there are many reasons to suspect the user
> is doing something gravely wrong when they mess with the pristine files.

So maybe a developer tool for (un)packing pristine archives should be created.

> Another point raised by Markus is to store "common pristine" files and
> reuse them to reduce network traffic.

This point is seen to be independently of the compression of pristine store. 
Both a working-copy local and a common pristine store can profit from the 
compression.

> Sqlite may be used as Branko has suggested. I'm not opposed to this. It
> has it's shortcomings (not exploiting inter-file similarities which point
> #3 makes, for one) but it can be considered as a compromise between
> individual gz files and the custom pack file. The basic idea would be to
> store "small" files (after compression) in wc.db and have "link" to
> compressed files on disk for "large" files.

Maybe a distinct pristine.db bettern than to put them in wc.db, but I'm not 
sure about that.

> My main concern is that
> frequent updates to small files will leave the sqlite file with heavy
> external fragmentation (holes within the file unused but consuming disk-
> space).

Usually, sqlite re-uses free space within the same database rather efficiently.

> The solution is to "vacuum" wc.db, but that depends on its size,
> will lock it when vacuuming and other factors, so we can't do it as
> routine.

"svn cleanup" would be a good opportunity.


I just had another idea, we could store the metadata in the SQLite database:

In the wc.db, in the pristine table, store 4 rows[1]: filename, offset, length, 
algorithm.

"filename" denotes the container file name. Payload files are first 
concatenated, then compressed, then put into the container. Offset and length 
are byte-offsets in the decompressed bytestream. "Algorithm" denotes the 
compression algorithm, with one value reserved for uncompressed storage. If a 
container grows beyond a specific limit, a new file is created.

The main advantage of storing the metadata in SQLite is that we do not need to 
invent any new file format.

Some other positive aspects (some of them are clearly also possible using your 
original proposal):

- This allows to apply concatenation and compression orthogonally, on a 
container-by-container basis:
  - So we can handle short, non-compressable files just by concatenating them 
in an uncompressed container.
  - We can handle large, well-compressable files the "each file has its own 
container" way.

- By reserving a special length value (like -1 or SQL NULL) for "look at the 
file on disk", we can quickly upgrade existing working copies without touching 
the pristine files at all.
  - This way, we could make the WC upgrade implicit again, as we only add three 
columns with well-defined default values (offset=0, length=-1, 
algorithm=uncompressed) to the table.
  - "svn cleanup" could grow an option to reorganize / optimize the pristine 
storage.

- "debuggability" is somehow given:
  - If the SQLite db is still intact, the pristines can be decompressed and 
split into pieces just with tools like zcat, head and tail.
  - Even if the db is borked, most file formats have some heuristical 
"begin/end" markers (#include-lines, JFIF-Header, etc.) which allows forensics 
to find the ofsett and size using hexdump.

- As most current decompressors for gz and lzma transparently support the 
decompression of streams which are "first compressed, then concatenated", we 
could even try to exploit transfer encodings (like transparent gz compression 
in http) which might already deliver us compressed files.

The disadvantage clearly is that we need a few more bits when storing that 
metadata in the SQLite database, instead of in our own file. But in my eyes, 
this few bytes do not outweigh the overhead of inventing our own metadata 
storage format, including correct synchronization, transaction safety etc, 
which are already provided reliably by sqlite.

Best regards

Markus Schaber

[1] Plus the additional rows like ref_count, checksum etc., which are needed by 
svn, but are not of interest for this discussion.
-- 
___________________________
We software Automation.

3S-Smart Software Solutions GmbH
Markus Schaber | Developer
Memminger Str. 151 | 87439 Kempten | Germany | Tel. +49-831-54031-0 | Fax 
+49-831-54031-50

Email: m.scha...@3s-software.com | Web: http://www.3s-software.com 
CoDeSys internet forum: http://forum.3s-software.com
Download CoDeSys sample projects: 
http://www.3s-software.com/index.shtml?sample_projects

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915

AW: Compressed Pristines (Custom Format?)

Reply via email to