At Thu, 23 May 2013 07:09:17 -0400, Eli Barzilay wrote:
> "Relevant history" is vague.

The history I want corresponds to `git log --follow' on each of the
files that end up in a repository.

> The thing that you can't do with
> filter-branch is keep the complete history if you remove files from
> the history -- the files that are gone go with their history.

That's true if you use `git filter-branch' in a particular way. I'll
suggest an alternative way, which involves filtering the set of files
in a commit-specific way. That is, the right set of files to keep for
each commit are not the ones in the final place, but the ones whose
history we need at each commit.


To make sure I'm not confused, I've implemented this idea. My
implementation is unlikely to be exactly right, yet, but I think it
works as a proof of concept.


The enclosed "slice.rkt" script takes a subdirectory and a destination
directory. Run it in the top directory of a git repository, and it
finds all the files in the given subdirectory, and then it closes over
the history of each file via `git log --follow'.

>From that point, we could use the computed set of paths as the ones to
keep during a `git filter-branch' on every commit, but that's not
ideal. For example, a file in collection "a" that is destined for
package "a" may have originated in "b" (think "mzlib"), where the
same-named file sticks around in "b" after the copy. It's nicer and
cleaner to have irrelevant files disappear after the relevant copy/move
is made.

So, I took one more step: "slice.rkt" constructs a range of commits
during which the file should exist, based on when it was moved or
copied. (Forks and merges are a minor obstacle, which the script works
around by enlarging ranges to hit commits in the `--first-parent'
traversal.) Conceptually, the result is a mapping from commit ids to
paths, but that would be a big table to read on every `filter-branch'
step, so it's reported as a table of commits with enter/leave
transitions. The output of "slice.rkt" is files: "state.rktd" for the
set of files to be kept in the initial commit, and "actions.rktd" to
specify the transitions.

The enclosed "prune.rkt" script works with `git filter-branch
--index-filter'. It uses "actions.rktd" (read-only) and "state.rktd"
(which it updates via transitions).


The Racket git repo is large, so I've only tried the `git
filter-branch' step so far on smaller repos, such as the "iplt"
repository. In my clone of "iplt", I `git mv'ed "web/internal" to
"ex/internal". Then, with the scripts in "/tmp",

 racket /tmp/slice.rkt ex /tmp
 git filter-branch --index-filter "racket /tmp/prune.rkt /tmp" --prune-empty

leaves the repo with only the files of "ex", and `git log --follow'
on various files looks right.

I'll try on a clone of the Racket repo and report back.

FWIW, before doing this for real, I'd want to add a `--msg-filter' that
extends each commit message to add the original commit id, since we
have references to the old ids in various places (and so it would be
handy to have them in the new repos).

Attachment: slice.rkt
Description: Binary data

Attachment: prune.rkt
Description: Binary data

_________________________
  Racket Developers list:
  http://lists.racket-lang.org/dev

Reply via email to