Hi Junio,

On Tue, 8 Oct 2019, Junio C Hamano wrote:

> Johannes Schindelin <johannes.schinde...@gmx.de> writes:
>
> >> I didn't quite understand this part, though.
> >>
> >>     The default creation factor is 60 (roughly speaking, it wants 60% of
> >>     the lines to match between two patches, otherwise it considers the
> >>     patches to be unrelated).
> >>
> >> Would the updated creation factor used which is 95 (roughly
> >> speaking) want 95% of the lines to match between two patches?
> >>
> >> That would make the matching logic even pickier and reject more
> >> paring, so I must be reading the statement wrong X-<.
> >
> > No, I must have written the opposite of what I tried to say, is all.
>
> So, cfactor of 60 means at most 60% is allowed to differ and the
> two patches are still considered to be related, while 95 means only
> 5% needs to be common?  That would make more sense to me.

Okay, I not only wrote the opposite of what I wanted to say, I also
misremembered.

When `range-diff` tries to determine matching pairs of patches, it
builds an `(m+n)x(m+n)` cost matrix, where `m` is the number of patches
in the first commit range and `n` is the number of patches in the second
one.

Why not `m x n`? Well, that's the obvious matrix, and that's what it
starts with, essentially assigning the number of lines of the diff
between the diffs as "cost".

But then `git range-diff` extends the cost matrix to allow for _all_ of
the `m` patches to be considered deleted, and _all_ of the `n` patches
to be added. As cost, it cannot use a "diff of diffs" because there is
no second diff. So it uses the number of lines of the one diff it has,
multiplied by the creation factor interpreted as a percentage.

The naive creation factor would be 100%, which is (almost) as if we
assumed an empty diff for the missing diff. But that would make the
range-diff too eager to dismiss rewrites, as experience obviously showed
(not my experience, but Thomas Rast's, who came up with `tbdiff` after
all): the diff of diffs includes a diff header, for example.

The interpretation I offered (although I inverted what I wanted to say)
is similar in spirit to that metric (which is not actually a metric, I
believe, because I expect it to violate the triangle inequality) is
obviously inaccurate: the number of lines of the diff of diffs does not
say anything about the number of matching lines, quite to the contrary,
it correlates somewhat to the number of non-matching lines.

So a better interpretation would have been:

        The default creation factor is 60 (roughly speaking, it wants at
        most 60% of the diffs' lines to differ, otherwise it considers
        them not to be a match.

This is still inaccurate, but at least it gets the idea of the
range-diff across.

Of course, I will never be able to amend the commit message in
GitGitGadget anyway, as I have merged that PR already.

Ciao,
Dscho

Reply via email to