Re: [PATCH 1/2] blame: large-scale performance rewrite

Shawn Pearce Sun, 27 Apr 2014 10:54:07 -0700

On Sat, Apr 26, 2014 at 2:39 PM, David Kastrup <d...@gnu.org> wrote:
>
> At least the stuff I fixed with regard to performance would seem to be
> done right in JGit to start with.


Hah! Its Java. We have to do things right, otherwise its too slow. :-)

>> Its still not as fast as I want it to be. :-)
>
> Most of the diff data/CRC is computed over and over because of the
> blackbox use of xdiff.

Yes, I have been thinking about this all week.

JGit blame uses HistogramDiff by default instead of MyersDiff. The
first stage after we trim common header/trailer from both files is to
compute a hash of each line and store those hashes. Those hashes are
discarded as the blame algorithm moves to the next commit.

Clearly for a commit chain of A -> B -> C, the hashes computed at B
for the A->B compare can be reused for the B->C compare. This is not
the case in either git-core or JGit, because the diff algorithm is a
block box to the blame algorithm. I think this is what you mean by the
CRC being computed again.


For any given compare blame has a list of regions it is interested in
learning about from the diff algorithm. Anything outside of those
regions is useless noise that will be discarded. I have been pondering
pushing that region list down into the diff algorithm so it can avoid
executing on sections that are not relevant to the caller. At least
for HistogramDiff this makes some sense, the algorithm is recursively
applied after it finds a longest common subsequence. If one side of
the LCS is outside of the region of interest from blame, there is no
value in recursing on that portion.

If the blame region list covers a small enough portion, it may even
make sense to avoid the common header/trailer elimination
preprocessing step. Unfortunately that sounds "hard" as you could be
working with a file like a ChangeLog which grows on one of those
sides.


>  And then the delta-chain storage is packing
> stuff based on CRCs as well (not sure whether it keeps them around for
> unpacking).

There are CRCs validated by libz during inflation, but these aren't
rechecked once the inflated bytes are cached in that silly undersized
16M delta base cache.

> So there is a lot that could likely be improved while
> keeping the same basic algorithms, by cracking open the black boxes of
> the xdiff engine and the delta-chain coding.

The delta chain coding has no relationship to the source file.
Currently even plain text files are delta chain coded on a byte basis,
not a line basis. Just matching up the delta coding against a source
text file to determine lines 0-N are common is costly, since you have
a byte range in the delta coding and you want a line range out the
end.

To make things more challenging, the delta chain coding can be against
completely different blobs. In a compare of A->B from the commit graph
being walked by blame there is no requirement the delta coding uses
this pairing, and it almost certainly never uses this direction (its
usually B->A, if its even this pair!).


Given your comments in the other patch, I understand why you probably
won't be working on blame more. But the above may help someone else
that has available time to continue.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] blame: large-scale performance rewrite

Reply via email to