On Sun, 2017-09-17 at 08:36 +0100, Ian Campbell wrote:
> +if test -n "$state_branch"
> +then
> > + echo "Saving rewrite state to $state_branch" 1>&2
> > + state_blob=$(
> > + perl -e'opendir D, "../map" or die;
> > + open H, "|-", "git hash-object -w --stdin" or die;
> > + foreach (sort readdir(D)) {
> > + next if m/^\.\.?$/;
> > + open F, "<../map/$_" or die;
> > + chomp($f = <F>);
> > + print H "$_:$f\n" or die;
> > + }
> > + close(H) or die;' || die "Unable to save state")
One things I've noticed is that for a full Linux tree history the
filter.map file is 50M+ which causes github to complain:
remote: warning: File filter.map is 54.40 MB; this is larger than GitHub's
recommended maximum file size of 50.00 MB
(you can simulate this with `git log --pretty=format:"%H:%H"
upstream/master`.) I suppose that's not a bad recommendation for any
infra, not just GH's.
The blob is compressed in the object store so there isn't _much_ point
in compressing the map (also, it only goes down to ~30MB anyway so we
aren't buying all that much time), but I'm wondering if perhaps I
should look into a more intelligent representation, perhaps hashed by
the first two characters (as .git/objects is) to divide into several
blobs and have two levels.
I'm also wondering if the .git-rewrite/map directory, which will have
70k+ (and growing) directory entries for a modern Linux tree, would
benefit from the same sort of thing. OTOH in this case the extra shell
machinations to turn abcdef123 into ab/cdef123 might overwhelm the
savings in directory lookup time (unless there is a helper already for
that. That assume that directory lookup is even a bottleneck, I've not
measured but anecdotally/gut-feeling the commits-per-second does seem
to be decreasing over the course of the filtering process.
Ian.