Hi,

In an attempt to reduce the footprint of pristine files in a working copy (WC) 
the compressed pristines (CP) feature is proposed[1]. There has been discussion 
and debates regarding the particulars of the proposal. This is a summary of 
this discussion and consolidation/checkpoint of the thread in preparation to 
move forward. I tried to keep it short, but there is quite a bit of technical 
details that are pertinent and should be included for a balanced picture.

::Summary::

The design document[1] outlines rationale behind the feature as well as 
requirements. So far these weren't challenged. To summarize them:

Pristine files currently incur almost 100%[2] overhead both in terms of disk 
footprint and file count in a given WC. Since pristine files is a design 
element of SVN, reducing their inherent overhead should be a welcome 
improvement to SVN from a user's perspective. Due to the nature of source files 
that tend to be small, the footprint of a pristine store (PS) is larger on disk 
than the actual total bytes because of internal fragmentation (file-system 
block-size rounding waste) - see references for numbers. The proposal takes 
this into consideration and proposes packing small files together into larger 
files to reduce this sub-block waste to a minimum. The packing of pristine 
files introduced a new complication, however, that of efficiently adding and 
removing pristine files from a pack. The solution was to introduce a minimalist 
custom file format to support SVN requirements from the ground up. This custom 
file format has been challenged by many members
 of the community.

::Solutions::

Broadly speaking, there has been 3 proposed designs for the CP feature.

1) In-place compressed files (aka gzipped pristine files).
2) Custom pack files (split by a cutoff size so large files are practically 
compressed in-place).
3) Sqlite database (for small files + files on disk for large ones).

::Observations::

There has been tests and simulations[3][4][5] that attempted to collect hard 
numbers on the 3 alternatives. The gist of all that can be summarized as 
follows:

a) Sub-block waste of 10-50% was observed on untouched SVN1.7 checkouts[6].
b) Sub-block waste increases for in-place compression percentage-wise, limiting 
the potential savings[7].
c) In-place compressed files when combined (packed) a *further* 150-500% 
reduction was obtained [8].
d) Projects that have a very large number of files (50k+) tend to have a small 
average size, hence they have the biggest potential to gain from reduced 
sub-block waste and faster file-reads thanks to packing (reduced disk seeks and 
improved read cache utilization).
e) Sqlite has a solid offering that may be used to pack small files easily.
f) Sqlite may suffer from external fragmentation (holes in the file that aren't 
used due to deletions) which counterweights the savings by keeping the PS 
footprint artificially larger than necessary.
g) Sqlite is *not* suitable for large files (typically anything beyond several 
dozens of KBs).

::Critique::

1) In-place compression has the flattest technical challenges curve. It's 
readily available via the compressed streams API in SVN using Zlib. While this 
is a huge advantage, the reduction in disk space is limited. Tests show that in 
the best case scenario (one case only) 60% compression wasn't attainable and 
most cases didn't hit 50% while others were limited to 25%. Packed compression 
improved these numbers by 150% for the first case and exceeded 500% in the 
other extreme. In addition, in-place compression will not reduce pressure on 
the file-system and can't exploit any advantages that the other two proposals 
do. A minor disadvantage voices is the fact that the pristine files remain 
transparent to the users and may potentially encourage unsupported 
manipulations.

2) A custom file format holds the promise of supporting an arbitrary number of 
requirements, is self-contained, extensibility and flexibility. The 
implementation may be incrementally improved to add new features and operations 
that yield improved performance and/or disk savings. Once we have a custom 
format, we can sort and combine files before compression to exploit inter-file 
similarities. This is shown to yield significant savings without adding too 
much complexity (see references for numbers). The major critique of this 
approach is that of the overhead, risks and complexity associated with 
implementing and maintaining a custom file format of release grade. If the 
benefits are doubtful to justify this overhead, then we should forgo going this 
route.

3) Sqlite has a very reputable code-base and performance, in addition to 
already being utilized by SVN. A proposal to use it for small pristine file 
storage has been proposed. The major advantage is the possibility of abusing 
and/or overloading Sqlite with this kind of usage that it probably isn't 
optimized to handle. In addition, if the database file increases too much, 
performance and disk usage may deteriorate and require costly maintenance 
passes to improve them. Sqlite will not handle large file, which will be stored 
separately on disk, this is similar to the custom file format case where large 
files will end up not sharing other files in the pack file they are in.

::Conclusion::

Regardless of what approach we take, it'll most probably feature two 
properties: incremental implementation and benchmarking. The simplest approach 
could be taken first, keeping in mind that we need to collect more data and 
perform rigorous testing to ensure stability and reliability. In-place 
compression while very easily attainable in implementation, is very limited and 
will serve only the smaller WC's with large average file-size. Sqlite holds a 
great promise that should be explored. Where the cutoff for small files are, 
what page-size should we choose and other Sqlite configuration details that may 
affect performance is subject to further research. Creating a custom file 
format that is simple, yet powerful enough to give us flexibility is still a 
very exciting possibility. One that should probably be explored if for nothing 
other than research and comparison. Sqlite can be used as a first 
implementation and the simplest working pack-file
 implementation may be added for comparison and experimentation.

Thank you for reading. Please share your opinion, notes and concerns.

[1] https://docs.google.com/document/d/1ktIsewfMBMVBxbn-Ng8NwkNwAS_QJ6eC7GOygsbBeEc/edit
[2] Since pristines are stored using SHA-1 hash, identical files are stored 
only once.
[3] 
http://mail-archives.apache.org/mod_mbox/subversion-dev/201203.mbox/%3c1332695820.9245.yahoomail...@web161401.mail.bf1.yahoo.com%3E
[4] http://svn.haxx.se/dev/archive-2012-03/0573.shtml

[5] http://svn.haxx.se/dev/archive-2012-03/0578.shtml 

[6] Tested trunks were SVN, GCC, WebKit, OOO and WP late March 2012.
[7] Files on which compression doesn't reduce them by multiples of block size 
don't shrink on disk. Therefore, files smaller than a single block can never be 
shrunk more, even with the best compression algorithms.
[8] These savings are due to inter-file similarities, sorting (exploits 
statistical similarities between same-type files) and avoiding sub-block waste. 
The latter has the biggest effect on WC's with small average file-sizes while 
sorting has the least effect in the order of single-digit percentage 
improvement.

-Ash

Reply via email to