> From: Michael Mossey <michaelmos...@gmail.com>
> 
> I'm very new to git, and so far I'm using it more like a system to back up 
> files and transfer them from computer to computer, no branching or anything 
> complicated yet. In one directory tree, most of my files are binary files, 
> so git can't take advantage of storing only deltas and can't compress the 
> files much. My database size has skyrocketed in the past month as a result. 
> At the moment I really don't need versions of certain files or directories 
> older than, say, 2 weeks. Is there a way to delete old history by that 
> criteria (say anything before a specific date?).

If you're dealing with binary files whose contents change rapidly,
there aren't any good solutions.

I have some Git repositories which I use to track large parts of my
system.  One tracks all the "system" files.  I only do a commit once a
week or so, but since those files rarely change, the fact that many of
the files are binaries and don't compress well doesn't matter.

For my home directory, I keep a repository I call "time warp", which
is somewhat like Apple's Time Machine.  cron does a Git commit every
minute.  But every night I run a cron job to "prune" it by removing a
subset of the commits.  That requires recreating all of the commits,
because the parents of the commits changes.

There is a "rate" R, and the goal is that for times that are X minutes
in the past, only keep commits that are spaced X/R minutes apart.  In
this case, I set "git config time-warp.rate 365", so commits from two
days ago are pruned to be about 8 minutes apart, while commits from a
year ago are pruned to be about 1 day apart.  In the long run, the
number of commits grows about as the logarithm of how long I've been
running time-warp.

Dale
----------------------------------------------------------------------
time-warp-prune

#! /bin/perl

use strict;

# Process the switches.

# $debug is the debugging print level.
my($debug) = 0;
# $do_gc is true if git-gc is to be done after pruning.
my($do_gc) = 1;
while ($ARGV[0] =~ /^-/) {
    # The -d switch must have a numeric argument.
    if ($ARGV[0] =~ /^-d([\d]+)$/) {
        $debug = $1;
        shift;
    }
    # The --no-gc switch suppresses the garbage collection afterward.
    elsif ($ARGV[0] eq '--no-gc') {
        $do_gc = 0;
        shift;
    }
    # Handle unknown options.
    else {
        die "Unknown option: '", $ARGV[0], "'\n"
    }
}
# Handle unknown arguments.
die "Unknown argument(s): ", join(' ', @ARGV) if $#ARGV >= 0;
if ($debug) {
    print STDERR "\$debug = $debug\n";
    print STDERR "\$do_gc = $do_gc\n";
}

# This is the rate at which commits are to be retained:
my($rate);
# At a time N in the past, commits should be spaced at most N/$rate
# apart.
# Thus, larger $rate values mean to keep more commits around.
# The rate is stored in the Git configuration as time-warp.rate.
# If the user entered it on the command line, it would be easier for
# the user to fumble-finger a small value and delete much of the
# history he wanted to save.
my($config_name) = 'time-warp.rate';
my($command) = 'git config ' . $config_name;
chomp($rate = `$command`);
my($r) = $? >> 8;
if ($r != 0) {
    warn "Could not obtain Git configuration value '$config_name'.\n";
    die "Error executing '$command': exit code $r\n" if $r;
} elsif (!($rate =~ /^\d+$/ && $rate >= 1)) {
    die "Rate value '$rate' is syntactically incorrect or less than 1.\n";
}
print STDERR "\$rate = $rate\n" if $debug;

# Get the hashes and times of the commit history.
# Note that we are assuming that the current branch of the repository
# is the branch to be operated upon.
$command = "git log --pretty=tformat:'%H %ct'";
print STDERR "\$command = $command\n" if $debug >= 3;
open(GIT, "-|", $command) ||
    die "Error executing '$command' for input: $!\n";
# Note that "git log" lists commits going back in time, so @hashes and @times
# will describe the latest commits first.
my(@hashes, @times);
while (<GIT>) {
    chomp;
    my($hash, $time) = split;
    push(@hashes, $hash);
    push(@times, $time);
    print STDERR "\$hashes[", $#hashes, "] = $hash, \$times[", $#times, "] = 
$time\n" if $debug >= 2;
}
close GIT || die "Error closing '$command': $!\n";

# Get the "now" time, which is the time of the last commit.
my($now) = $times[0];
print STDERR "\$now = $now\n" if $debug;

# Now, working from oldest to newest, look at each commit and decide whether
# to recreate it.
# The last commit we've recreated and its time.
my($last_commit) = '0000000000000000000000000000000000000000';
my($last_commit_time) = 0;
my($commits_created) = 0;
print STDERR "\$last_commit = $last_commit, \$last_commit_time = 
$last_commit_time\n"
    if $debug;
# Cycle through the commits from the oldest to the newest, recreating
# the commit chain, retaining the commits we desire.
for (my $i = $#hashes; $i >= 0; $i--) {
    print STDERR "\$i = $i, \$hashes[$i] = $hashes[$i], \$times[$i] = 
$times[$i]\n"
        if $debug;
    # Test if commit $i-1 (the next-newer commit than this one) is
    # close enough to $last_commit that we can omit creating a new
    # commit from this one, commit $i.  We always generate a new
    # commit from commit 0, which is the newest.
    if ($i > 0 && $debug) {
        print STDERR "\$times[", $i-1, "] = $times[$i-1], \$last_commit_time = 
$last_commit_time, \$now = $now, \$last_commit_time = $last_commit_time\n";
        print STDERR $times[$i-1] - $last_commit_time, ' > ', ($now - 
$times[$i-1]) / $rate, "\n";
    }
    if ($i == 0 ||
        $times[$i-1] - $last_commit_time >= ($now - $times[$i-1]) / $rate) {
        print STDERR "Recreate commit $i: $hashes[$i]\n" if $debug;
        # Commit $i-1 (the next-newer commit than this one) is too far from
        # the last new commit we created, so we have to create a new
        # commit out of commit $i.
        $last_commit = &create_new_commit($hashes[$i], $last_commit);
        $last_commit_time = $times[$i];
        $commits_created++;
        print STDERR "\$last_commit = $last_commit, \$last_commit_time = 
$last_commit_time\n"
            if $debug;
    }
}

# Set the HEAD to the new commit.
# Tell git to check that the current HEAD is $hashes[0] (what it was 
previously),
# so that if the repository has been modified during our execution, we can 
abort.
my(@command) = ('git', 'update-ref', 'HEAD', $last_commit, $hashes[0]);
system(@command);
$r = $? >> 8;
die "Error executing '" . join(' ', @command) . "': exit code $r\n" if $r;
print STDERR "system(" . join(' ', @command) . "): exit code $r\n" if $debug;

# Update the index to match HEAD.
# Specify -q, because "git reset" can incorrectly report a lot of files as
# unstaged changes.
@command = ('git', 'reset', '-q', 'HEAD');
system(@command);
$r = $? >> 8;
die "Error executing '" . join(' ', @command) . "': exit code $r\n" if $r;
print STDERR "system(" . join(' ', @command) . "): exit code $r\n" if $debug;

print $#hashes+1, " previous commits, ",
    $commits_created, " commits were recreated, ",
    $#hashes-$commits_created+1, " commits were dropped\n";

# Garbage collect the repository.
if ($do_gc) {
    print "Garbage collecting and compressing the repository.\n";
    # Use "git gc --aggressive", as the memory consumed by repacking can
    # be controlled by the pack.windowMemory configuration value.
    @command = ('git', 'gc', '--quiet', '--aggressive');
    system(@command);
    $r = $? >> 8;
    die "Error executing '" . join(' ', @command) . "': exit code $r\n" if $r;
    print STDERR "system(" . join(' ', @command) . "): exit code $r\n" if 
$debug;
}

# Print out total space usage.
@command = ('du', '-sh', '.git');
system(@command);
$r = $? >> 8;
die "Error executing '" . join(' ', @command) . "': exit code $r\n" if $r;
print STDERR "system(" . join(' ', @command) . "): exit code $r\n" if $debug;

exit 0;

# Create a new commit.
# Takes as input the hash of the commit to recreate and the hash to be
# used as the parent of the new commit.
# Returns the new commit hash.
sub create_new_commit {
    my($old_commit, $parent) = @_;
    print STDERR "\&create_new_commit('$old_commit', '$parent')\n"
        if $debug >= 3;
    print STDERR "\$old_commit = $old_commit, \$parent = $parent\n"
        if $debug >= 3;

    my($command) = "git cat-file -p $old_commit";
    print STDERR "\$command = $command\n" if $debug >= 3;
    open(GIT, "-|", $command) ||
        die "Error executing '$command' for input: $!\n";
    # Parse the header part of the commit object.
    my($tree, $author_name, $author_email, $author_date,
       $committer_name, $committer_email, $committer_date);
    while (<GIT>) {
        chomp;
        print STDERR "\$_ = $_\n" if $debug >= 3;
        if ($_ eq '') {
            last;
        } elsif (/^tree (.*)/) {
            $tree = $1;
        } elsif (/^author (.*) <([^ ]*)> ([-+\d\s]*)$/) {
            $author_name = $1;
            $author_email = $2;
            $author_date = $3;
        } elsif (/^committer (.*) <([^ ]*)> ([-+\d\s]*)$/) {
            $committer_name = $1;
            $committer_email = $2;
            $committer_date = $3;
        } else {
            ;
        }
    }
    # The remainder of the commit object is the commit message.
    my(@commit_message) = <GIT>;
    print STDERR "\@commit_message = '", join("\n", @commit_message) ,"'\n"
        if $debug >= 3;
    close GIT || die "Error closing '$command': $!\n";
            
    # Set up the environment for the author and committer information.
    $ENV{'GIT_AUTHOR_NAME'} = $author_name;
    print STDERR "GIT_AUTHOR_NAME = '$ENV{'GIT_AUTHOR_NAME'}'\n"
        if $debug >= 3;
    $ENV{'GIT_AUTHOR_EMAIL'} = $author_email;
    print STDERR "GIT_AUTHOR_EMAIL = '$ENV{'GIT_AUTHOR_EMAIL'}'\n"
        if $debug >= 3;
    $ENV{'GIT_AUTHOR_DATE'} = $author_date;
    print STDERR "GIT_AUTHOR_DATE = '$ENV{'GIT_AUTHOR_DATE'}'\n"
        if $debug >= 3;
    $ENV{'GIT_COMMITTER_NAME'} = $committer_name;
    print STDERR "GIT_COMMITTER_NAME = '$ENV{'GIT_COMMITTER_NAME'}'\n"
        if $debug >= 3;
    $ENV{'GIT_COMMITTER_EMAIL'} = $committer_email;
    print STDERR "GIT_COMMITTER_EMAIL = '$ENV{'GIT_COMMITTER_EMAIL'}'\n"
        if $debug >= 3;
    $ENV{'GIT_COMMITTER_DATE'} = $committer_date;
    print STDERR "GIT_COMMITTER_DATE = '$ENV{'GIT_COMMITTER_DATE'}'\n"
        if $debug >= 3;
    my(@command) = ("git", "commit-tree", $tree);
    push(@command, "-p", $parent)
        unless $parent eq '0000000000000000000000000000000000000000';
    print STDERR "\@command = '", join(' ', @command), "'\n" if $debug >= 3;

    # Set up to handle both stdin and stdout of the command.
    pipe(PARENT_RDR, CHILD_WTR);
    pipe(CHILD_RDR, PARENT_WTR);

    my($new_commit);
    my($pid) = fork();
    print STDERR "\$pid = '$pid'\n" if $debug >= 3;
    if (!defined($pid)) {
        # Error forking.
        die "Cannot fork to create subprocess: $!";
    } elsif ($pid) {
        print STDERR "Parent.\n" if $debug >= 3;
        # This is the parent process.
        close CHILD_RDR;
        close CHILD_WTR;
        # Write the commit message to the subprocess.
        print PARENT_WTR join('', @commit_message);
        close PARENT_WTR;
        # Read the new commit hash from the subprocess.
        chomp($new_commit = <PARENT_RDR>);
        close PARENT_RDR;
        # Wait for the subprocess to terminate.
        waitpid($pid, 0);
        # Check its exit status.
        my($r) = $? >> 8;
        print STDERR "\$r = $r\n" if $debug >= 3;
        die "Error executing '" . join(' ', @command) . "': exit code $r\n"
            if $r;
    } else {
        print STDERR "Child.\n" if $debug >= 3;
        close PARENT_RDR;
        close PARENT_WTR;
        open(STDIN, "<&CHILD_RDR");
        open(STDOUT, ">&CHILD_WTR");
        exec(@command) || die "Error exec('", join(' ', @command), "'): $!\n"
    }
    print STDERR "\$new_commit = '$new_commit'\n" if $debug >= 3;

    # Return the new commit hash.
    return $new_commit;
}

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to