On Tue, Jun 14, 2011 at 5:33 PM, Stefan Sperling <s...@elego.de> wrote: > On Tue, Jun 14, 2011 at 05:21:27PM +0200, Neels J Hofmeyr wrote: >> Hi Johan, >> >> it's been a while and I still haven't sent you my diff wish we briefly >> touched on the Subversion hackathon.
Hi Neels, thanks for pursuing this further. >> Here is a fabricated example of why I don't like diff to match empty lines: > >> A couple of lines get replaced by completely different ones. By matching the >> blank line in the middle, it becomes far less readable, IMHO. In my fantasy >> dream world, this diff would print: >> >> [[[ >> Index: x >> =================================================================== >> --- x (revision 1) >> +++ x (working copy) >> @@ -4,11 +4,13 @@ >> >> void aaa() >> { >> - if (x) >> - do(things); >> - >> - if (y) >> - do(stuff); >> + while (x || y) >> + { >> + check(something); >> + notify(stuff); >> + >> + try(somethingelse); >> + } >> >> bb(b); >> } >> ]]] Yeah, that's certainly a nicer diff for human consumption :-). But strictly speaking it's a larger diff (more lines marked as +/-), so that makes it less optimal for the current algorithm. The "minimality" criterion of diff (with the LCS) makes it easy to reason about, and makes for a nice and clear mathematical definition of the requested diff result. But I agree that it doesn't necessarily lead to "good-quality" diffs for human readers. So: good-quality != minimal, but it's more of a "soft" criterion, depends on the language, context, ... what lines are important to the user, ... Introducing heuristics in one form or another is probably the only way to improve this. I'm not an expert in this area myself (I'm actually more of a theoretical mathematician, so I'm naturally skeptical of anything without a formal proof :-)). But I have also read some good things about patience diff, like Stefan suggested ... > Do you know about patience diff? > http://bramcohen.livejournal.com/73318.html > I think we should try teaching this algorithm to svn diff at some point. > It's a lot more generic than just checking for empty lines and should > yield the results you want. Or if Morten has some great inspiration along similar lines, that might be equally good or better... On Tue, Jun 14, 2011 at 7:53 PM, Morten Kloster <mor...@gmail.com> wrote: > Actually, I was already planning on making the LCS algorithm estimate > how statistically significant each matching section is (just a simple > heuristic, of course, nothing mathematically exact) - I need this for the > proposals in my recent thread "Improvements to diff3 (merge) > performance" - and the standard diff could then take an option -noflukes > (for instance) which would only keep the significant matches. That > should eliminate (or at least reduce) both problems with blank lines and > similar issues with braces being matched in unrelated code. > > Estimating the significance should be quite quick, so no worries there. > --- > Morten Intuitively, I'd say: let's look into patience diff (or a variation thereof), because it's already being used in several (D)VCS'es, so it has already had a lot of exposure. But that's not really a strong argument :-). Maybe another approach is easier to implement in libsvn_diff, and yields equally good or even better results ... I don't know. One thing I'm not sure about: suppose we have a really good "heuristic diff", should we then make it the default (and make --minimal an option), or should we make it optional (--nice)? I guess we'll see whenever we get at the point where a heuristic diff is implemented :-). Note that GNU diff uses some heuristics by default (and --minimal is an option). Maybe GNU diff would already give a "nice" result with your example with its built-in heuristics? -- Johan