Re: Compressed Pristines (Summary)

Julian Foad Mon, 02 Apr 2012 03:16:41 -0700

Hi Ashnod.

1.  Filesystem compression.


Would you like to assess the 
feasibility of compressing the pristine store by re-mounting the 
"pristines" subdirectory as a compressed subtree in the operating 
system's file system?  This can be done (I believe) under Windows with 
NTFS <http://support.microsoft.com/kb/307987> and under Linux with 
FUSE-compress 
<http://code.google.com/p/fusecompress/>.  Certainly the 
trade-offs are different, compared with implementing compression inside 
Subversion, but delegating the task to a third-party 
subsytem could give us a huge advantage in terms of reducing the ongoing 
maintenance cost.


2.  Uncompressed copies.

There has been a lot of discussion about achieving maximal compression by 
exploiting properties of similarity, ordering, and so on.  That is an 
interesting topic.  However, compression is notthe only thing the pristine 
store needs to do.

The pristine store implementation also needs to provide *uncompressed* copies 
of the files.  Some of the API consumers can and should read the data through 
svn_stream_t; this is the easy part.  Other API consumers -- primarily those 
that invoke an external 'diff' tool -- need to be given access to a complete 
uncompressed file on disk.

At the moment, we just pass them the path to the file in the pristine store.  
When the pristine file is compressed, I imagine we will need to implement a 
cache of uncompressed copies of the pristine files.  The lifetimes of those 
uncompressed copies will need to be managed, and this may require some changes 
to the interface that is used to access them.  A typical problem is: user runs 
"svn diff", svn starts up a GUI diff tool and passes it two paths: the path to 
an uncompressed copy of a pristine file, and the path of a working-copy file.  
The GUI tool runs as a separate process and the "svn" process finishes.  Now 
the GUI diff is still running, accessing a file in our uncompressed-pristines 
cache.  How do we manage this so that we don't immediately delete the 
uncompressed file while the GUI diff is still displaying it, and yet also know 
when to clean up our cache later?

We could of course declare that the "pristine store" software layer is only 
responsible for providing streamy read access, and the management of 
uncompressed copies is the responsibility of higher level code.  But no matter 
where we draw the boundary, that functionality has to be designed and 
implemented before we can successfully use any kind of compression.

- Julian




>________________________________
> From: Ashod Nakashian <[email protected]>
>
>In an attempt to reduce the footprint of pristine files in a working copy (WC) 
>the compressed pristines (CP) feature is proposed[1]. There has been 
>discussion and debates regarding the particulars of the proposal. This is a 
>summary of this discussion and consolidation/checkpoint of the thread in 
>preparation to move forward. I tried to keep it short, but there is quite a 
>bit of technical details that are pertinent and should be included for a 
>balanced picture.
>
>::Summary::
>
>The design document[1] outlines rationale behind the feature as well as 
>requirements. So far these weren't challenged. To summarize them:
>
>Pristine files currently incur almost 100%[2] overhead both in terms of disk 
>footprint and file count in a given WC. Since pristine files is a design 
>element of SVN, reducing their inherent overhead should be a welcome 
>improvement to SVN from a user's perspective. Due to the nature of source 
>files that tend to be small, the footprint of a pristine store (PS) is larger 
>on disk than the actual total bytes because of internal fragmentation 
>(file-system block-size rounding waste) - see references for numbers. The 
>proposal takes this into consideration and proposes packing small files 
>together into larger files to reduce this sub-block waste to a minimum. The 
>packing of pristine files introduced a new complication, however, that of 
>efficiently adding and removing pristine files from a pack. The solution was 
>to introduce a minimalist custom file format to support SVN requirements from 
>the ground up. This custom file format has been challenged by many members
>of the community.
>
>::Solutions::
>
>Broadly speaking, there has been 3 proposed designs for the CP feature.
>
>1) In-place compressed files (aka gzipped pristine files).
>2) Custom pack files (split by a cutoff size so large files are practically 
>compressed in-place).
>3) Sqlite database (for small files + files on disk for large ones).
>
>::Observations::
>
>There has been tests and simulations[3][4][5] that attempted to collect hard 
>numbers on the 3 alternatives. The gist of all that can be summarized as 
>follows:
>
>a) Sub-block waste of 10-50% was observed on untouched SVN1.7 checkouts[6].
>b) Sub-block waste increases for in-place compression percentage-wise, 
>limiting the potential savings[7].
>c) In-place compressed files when combined (packed) a *further* 150-500% 
>reduction was obtained [8].
>d) Projects that have a very large number of files (50k+) tend to have a small 
>average size, hence they have the biggest potential to gain from reduced 
>sub-block waste and faster file-reads thanks to packing (reduced disk seeks 
>and improved read cache utilization).
>e) Sqlite has a solid offering that may be used to pack small files easily.
>f) Sqlite may suffer from external fragmentation (holes in the file that 
>aren't used due to deletions) which counterweights the savings by keeping the 
>PS footprint artificially larger than necessary.
>g) Sqlite is *not* suitable for large files (typically anything beyond several 
>dozens of KBs).
>
>::Critique::
>
>1) In-place compression has the flattest technical challenges curve. It's 
>readily available via the compressed streams API in SVN using Zlib. While this 
>is a huge advantage, the reduction in disk space is limited. Tests show that 
>in the best case scenario (one case only) 60% compression wasn't attainable 
>and most cases didn't hit 50% while others were limited to 25%. Packed 
>compression improved these numbers by 150% for the first case and exceeded 
>500% in the other extreme. In addition, in-place compression will not reduce 
>pressure on the file-system and can't exploit any advantages that the other 
>two proposals do. A minor disadvantage voices is the fact that the pristine 
>files remain transparent to the users and may potentially encourage 
>unsupported manipulations.
>
>2) A custom file format holds the promise of supporting an arbitrary number of 
>requirements, is self-contained, extensibility and flexibility. The 
>implementation may be incrementally improved to add new features and 
>operations that yield improved performance and/or disk savings. Once we have a 
>custom format, we can sort and combine files before compression to exploit 
>inter-file similarities. This is shown to yield significant savings without 
>adding too much complexity (see references for numbers). The major critique of 
>this approach is that of the overhead, risks and complexity associated with 
>implementing and maintaining a custom file format of release grade. If the 
>benefits are doubtful to justify this overhead, then we should forgo going 
>this route.
>
>3) Sqlite has a very reputable code-base and performance, in addition to 
>already being utilized by SVN. A proposal to use it for small pristine file 
>storage has been proposed. The major advantage is the possibility of abusing 
>and/or overloading Sqlite with this kind of usage that it probably isn't 
>optimized to handle. In addition, if the database file increases too much, 
>performance and disk usage may deteriorate and require costly maintenance 
>passes to improve them. Sqlite will not handle large file, which will be 
>stored separately on disk, this is similar to the custom file format case 
>where large files will end up not sharing other files in the pack file they 
>are in.
>
>::Conclusion::
>
>Regardless of what approach we take, it'll most probably feature two 
>properties: incremental implementation and benchmarking. The simplest approach 
>could be taken first, keeping in mind that we need to collect more data and 
>perform rigorous testing to ensure stability and reliability. In-place 
>compression while very easily attainable in implementation, is very limited 
>and will serve only the smaller WC's with large average file-size. Sqlite 
>holds a great promise that should be explored. Where the cutoff for small 
>files are, what page-size should we choose and other Sqlite configuration 
>details that may affect performance is subject to further research. Creating a 
>custom file format that is simple, yet powerful enough to give us flexibility 
>is still a very exciting possibility. One that should probably be explored if 
>for nothing other than research and comparison. Sqlite can be used as a first 
>implementation and the simplest working pack-file
>implementation may be added for comparison and experimentation.
>
>Thank you for reading. Please share your opinion, notes and concerns.
>
>[1] https://docs.google.com/document/d/1ktIsewfMBMVBxbn-Ng8NwkNwAS_QJ6eC7GOygsbBeEc/edit
>[2] Since pristines are stored using SHA-1 hash, identical files are stored 
>only once.
>[3] 
>http://mail-archives.apache.org/mod_mbox/subversion-dev/201203.mbox/%[email protected]%3E
>[4] http://svn.haxx.se/dev/archive-2012-03/0573.shtml
>
>[5] http://svn.haxx.se/dev/archive-2012-03/0578.shtml 
>
>[6] Tested trunks were SVN, GCC, WebKit, OOO and WP late March 2012.
>[7] Files on which compression doesn't reduce them by multiples of block size 
>don't shrink on disk. Therefore, files smaller than a single block can never 
>be shrunk more, even with the best compression algorithms.
>[8] These savings are due to inter-file similarities, sorting (exploits 
>statistical similarities between same-type files) and avoiding sub-block 
>waste. The latter has the biggest effect on WC's with small average file-sizes 
>while sorting has the least effect in the order of single-digit percentage 
>improvement.
>
>-Ash
>
>
>

Re: Compressed Pristines (Summary)

Reply via email to