On Wed, Oct 10, 2018 at 09:48:53AM +0900, Junio C Hamano wrote:

> Jeff King <p...@peff.net> writes:
> 
> > The one difference is the sort order: git's diff output is
> > in tree-sort order, so a subtree "foo" sorts like "foo/",
> > which is after "foo.bar". Whereas the bitmap path list has a
> > true byte sort, which puts "foo.bar" after "foo".
> 
> If we truly cared, it is easy enough to fix by having a custom
> comparison function in 1/3 used in collect_paths() phase.

Yep. I thought about doing it just so I could drop this "one difference"
note, but I got lazy.

Running this on linux.git, I do see a few other differences. It looks
like my code does actually compute lists of touched paths for some
merges (presumably using "-c"). That wasn't intended, and it would
actually make my timings less good, but my goal was just to get a rough
idea on size here (but see below).

> > +   /* dump it while we have the sorted order in memory */
> > +   for (i = 0; i < n; i++) {
> > +           printf("%s", sorted[i]->path);
> > +           putchar('\0');
> > +   }
> 
> With printf("%s%c", sorted[i]->path, '\0'); you can lose the braces.

Heh, I didn't really expect review at that level. I'm not even sure this
is a good direction to go versus something like the bloom filters (or
even a more full --raw cache). But if it is, this code is mostly
throw-away anyway, as we'd want to integrate it with the actual diff
code.

My original goal had mostly been to get an idea of the size, and the
"dump" half was there to verify that the results were roughly sane. But
it actually works for rough timing, too. I can generate roughly the same
results as "rev-list --all | diff-tree --stdin -t --name-only" in about
300ms, as opposed to 33s. So that's good.

But it's also a slight cheat, since I'm not actually traversing the
commits, but rather just opening up the bitmaps in the order we wrote
them. ;)

Actually walking the commits (and not looking at the trees) takes ~7s,
so it would at least be more like 33s versus 7.3s. With core.commitgraph,
it's more like 1.1s, so imagine 27s versus 1.4s, I guess.

That's also neglecting any load/lookup time for actual random-access to
the bitmaps. I doubt that's more than a few hundred ms, but that's just
a made-up number.

So I think the rough timings are favorable, but the real proof would
actually be using it from a revision walk, which I haven't written.

> > +   putchar('\0');
> > +
> >     free(sorted);
> >  }
> >  
> > @@ -142,6 +150,8 @@ static void generate_bitmap(struct diff_queue_struct *q,
> >  
> >     ewah = bitmap_to_ewah(bitmap);
> >     ewah_serialize_strbuf(ewah, &out);
> > +
> > +   fwrite(data->commit->object.oid.hash, 1, GIT_SHA1_RAWSZ, stdout);
> >     fwrite(out.buf, 1, out.len, stdout);
> 
> OK, so per commit, we have ewah bitmap that records the "changed
> paths" after the commit object name.  Makes sense.

Yeah. This format, btw, is garbage. It was just the smallest and
simplest thing I could think of that would work for my case. We'd want
random-access to the bitmaps for each commit, probably via an index
block in the commit-graph file.

> And the list of paths are based on the "one" side of the filepair.
> When we do an equivalent of "git show X", we see "diff-tree X~1 X"
> and by collecting the "one" side (i.e. subset of paths in the tree
> of X~1 that were modified when going to X) we say "commit X changed
> these paths".  Makes sense, too.

I didn't think too hard on whether we'd need to look at the "two" side
ever. I turned off renames, so we'd see deletions via the "one". I feel
like we'd miss additions in that case, though, but from my results, we
do not seem to.

-Peff

Reply via email to