Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic

Stefan Beller Mon, 18 Apr 2016 23:48:39 -0700

On Mon, Apr 18, 2016 at 10:03 PM, Jeff King <p...@peff.net> wrote:
> On Mon, Apr 18, 2016 at 02:12:30PM -0700, Stefan Beller wrote:
>
>> +
>> +             /*
>> +              * If a group can be moved back and forth, see if there is an
>> +              * blank line in the moving space. If there is a blank line,
>> +              * make sure the last blank line is the end of the group.
>
> s/an/a/ on the first line


So it looks like I'll be resending another version for this series tomorrow.
Thanks for pointing this out!

>
>> +              * As we shifted the group forward as far as possible, we only
>> +              * need to shift it back if at all.
>
> Maybe because I'm reading it as a diff that only contains this hunk and
> not the whole rest of the function, but the "we" here confused me. You
> mean the earlier, existing loop in xdl_change_compact, right?
>
> Maybe something like:
>
>   As we already shifted the group forward as far as possible in the
>   earlier loop...
>
> would help.

I'll see to get rid of the 'we', otherwise I'll stick with your suggestion.

>
>> +             if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>> +                     while (ixs > 0 &&
>> +                            !is_blank_line(recs, ix - 1, flags) &&
>> +                            recs_match(recs, ixs - 1, ix - 1, flags)) {
>> +                             rchg[--ixs] = 1;
>> +                             rchg[--ix] = 0;
>> +                     }
>> +             }
>
> This turned out to be delightfully simple (especially compared to the
> perl monstrosity).
>
> I tried comparing the output to the perl one, but it's not quite the
> same. In that one we had to work with the existing hunks and context
> lines, so any hunk that got shifted ended up with extra context on one
> side, and too little on the other. But here, we can actually bump the
> context lines to give the correct amount on both sides, which is good.
>
> I guess this will invalidate old patch-ids, but there's not much to be
> done about that.

For the record:
I thought about "optimal hunk separation" for a while, specially during my
bike commute. And while this heuristic seems to be a good fit for most of
the cases inspected, we can do better (in the future).

I am convinced the better way to do it is like this:

    Calculate the entropy for each line and take the last line with the
    lowest entropy as the last line of the hunk.

That heuristic requires more compute though as it will be hard to compute
the entropy for the line. To do that I would imagine, we'd need to loop over
the whole file and count the occurrences for each char (byte) and then
take the negative log of (#number of that byte / #number of bytes in file) [1].

This would model our actual goal a bit more closely to split at parts, where
there is low information density (the definition of entropy).

One example Jacob pointed out was a thing like

/**
 * Comment here. Over
 * more lines.
 *
+ *  Add line here with a blank line
+ *
+ * in between and a trailing blank after.
+ *
 */

I think we had cases like this in the kernel tree and else where,
and for a human it is clear to break after the last "empty line"
(which for comments starts with " * "). To detect those we can use
the entropy as it doesn't convey lots of information.
(git show e1f7037167323461c0415447676262dcb)

It also keeps the false positives out, Jacob pointed at
85ed2f32064b82e541fc7dcf2b0049a05 IIRC, which was bad with
the shortest lines only, but I'd imagine the entropy based
heuristic will do better there.

[1] https://en.wikipedia.org/wiki/Entropy_(information_theory)

Thanks for the review,
Stefan

>
> -Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] xdiff: implement empty line chunk heuristic

Reply via email to