I an working on a system to archive files for backup purposes.  The
current challenge is a Unix mailbox file which is about 100Mb.  By its
nature, new mail is added to the file at the end and most of that is
rapidly deleted.  The first 90%+ of the file is old mail that largely
doesn't change.  I put many successive versions of this file into Git,
and because of the enormous common sections between the files,
collectively, they should compress well.  However, I am having trouble
getting Git to give me good compression.

Here is a test scenario to illustrate the problem:

I have a test program that generates 50 successive versions of a test
file and commits them into an empty Git repository.  (I can provide
the program if you want to duplicate this.)  The versions are 100Mb of
lines of random text characters, with the first 90% of all versions
are identical, and the final 10% unique in each version.  Thus, the
files aggregate to 5Gb, altogether containing 590Mb of unique lines.

After committing the 50 versions, the repository uses 3.6Gb of disk
space, which is reasonable compression for files containing random
text characters, assuming that Git can't exploit the commonality
between the versions in this situation.

    $ time ./test-generator
    Initialized empty Git repository in 
    [master (root-commit) 149fb8b] Commit 0
     1 files changed, 1638400 insertions(+), 0 deletions(-)
     create mode 100644 file_base
    [master fa677ee] Commit 1
     1 files changed, 163840 insertions(+), 163840 deletions(-)
    [master a168631] Commit 49
     1 files changed, 163840 insertions(+), 163840 deletions(-)

    real    11m4.603s
    user    10m6.343s
    sys     0m30.205s
    $ du -sh .git
    3.6G        .git

I used "git gc --aggressive", but it was unable to finish compressing
and did not reduce disk usage:

    $ git gc --aggressive
    Counting objects: 150, done.
    Delta compression using up to 2 threads.
    warning: suboptimal pack - out of memory
    fatal: Out of memory, malloc failed (tried to allocate 106496001 bytes)
    error: failed to run repack
    $ du -sh .git
    3.6G        .git

However, I could use "git repack" to obtain compression, when I gave
it arguments to limit its memory usage.  At this moment, I forget
exactly what I read that prompted me to provide the particular options
I've given "git repack", although some of them were copied from the
options that "git gc" provided when running a subordinate "git
repack".  "git repack" did reduce disk consumption to the expected

    $ time git repack -d -l -f --depth=250 --window=250 -A --window-memory=1g
    Counting objects: 150, done.
    Delta compression using up to 2 threads.
    Compressing objects: 100% (100/100), done.
    Writing objects: 100% (150/150), done.
    Total 150 (delta 49), reused 0 (delta 0)

    real        15m29.263s
    user        14m54.896s
    sys 0m20.905s
    $ du -sh .git
    427M        .git

Based on all this, what is the best way to garbage collect such a
repository?  Are there ways to configure "git gc" to make it call "git
repack" with the needed arguments?  Or should I call "git gc" with
arguments to suppress the repack, and call "git repack" manually?


You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to