On Wed, Jul 09, 2014 at 08:51:07AM -0700, Junio C Hamano wrote:
> > The delta heuristics in pack-objects use pack_name_hash, which claims:
> > /*
> > * This effectively just creates a sortable number from the
> > * last sixteen non-whitespace characters. Last characters
> > * count "most", so things that end in ".c" sort together.
> > */
> > which might be another option (and seems like a superset of the basename
> > check, short of basenames that are longer than 16 characters).
> I am however not sure if the code to compute similarity score is as
> OK with false positives, i.e. dissimilar names that happen to hash
> together getting clumped in a same bin or in close bins, as the
> existing callers of pack_name_hash().
I think the hash here does not collide in that way. It really is just
the last sixteen characters shoved into a uint32_t.
But thinking on it more, that is useful to the delta code because it
wants to create a sorted list of items. In the rename code we are doing
pairwise comparisons, so we are more flexible. We can compare whole
basenames, or whole suffixes (so "a/foo/bar.c" is closer to
"b/foo/bar.c" than to "c/other/bar.c"). Or just use a general-purpose
The tricky part is that the rename detection seems to take the score as
a binary 0/1 "is it the same", but we would want to express more nuance
(i.e., the "best" match among those that have similar content scores).
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html