On Wed, Jul 09, 2014 at 08:51:07AM -0700, Junio C Hamano wrote:

> > The delta heuristics in pack-objects use pack_name_hash, which claims:
> >
> >         /*
> >          * This effectively just creates a sortable number from the
> >          * last sixteen non-whitespace characters. Last characters
> >          * count "most", so things that end in ".c" sort together.
> >          */
> >
> > which might be another option (and seems like a superset of the basename
> > check, short of basenames that are longer than 16 characters).
> Perhaps.
> I am however not sure if the code to compute similarity score is as
> OK with false positives, i.e. dissimilar names that happen to hash
> together getting clumped in a same bin or in close bins, as the
> existing callers of pack_name_hash().

I think the hash here does not collide in that way. It really is just
the last sixteen characters shoved into a uint32_t.

But thinking on it more, that is useful to the delta code because it
wants to create a sorted list of items. In the rename code we are doing
pairwise comparisons, so we are more flexible. We can compare whole
basenames, or whole suffixes (so "a/foo/bar.c" is closer to
"b/foo/bar.c" than to "c/other/bar.c"). Or just use a general-purpose
edit-distance function.

The tricky part is that the rename detection seems to take the score as
a binary 0/1 "is it the same", but we would want to express more nuance
(i.e., the "best" match among those that have similar content scores).

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to