Re: conflict algorithm

Greg A. Woods Mon, 07 Jan 2002 13:45:45 -0800

[ On Monday, January 7, 2002 at 14:35:20 (-0500), Eric Siegerman wrote: ]
> Subject: Re: conflict algorithm
>
> On Mon, Jan 07, 2002 at 06:48:20PM +0000, Duncan Sommerville wrote:
> > Whats the definition of a conflict?
> > 
> > In particular, what's the *scope* of its search - is it per line, per couple of 
>lines, etc?
> 
> I don't believe that's formally defined; in practice, it's "per few lines".


Ah!  A topic close to my heart!  :-)

The algorithm used by 'diff' (and thus 'diff3', thus implicitly defining
the meaning of a "conflict" in CVS) has been formally defined many times
and in several different ways!  :-)

This note from the footnotes of:
<URL:http://www.people.fas.harvard.edu/~lib113/reference/unix/unix9.txt>

    Unix diff itself is by Doug McIlroy, and is based on an              
    algorithm invented independently by Harold Stone and by Wayne Hunt         
    and Tom Szymanski. (See "A fast algorithm for computing longest            
    common subsequences," by J. W. Hunt and T. G. Szymanski, CACM, May,        
    1977.) The diff algorithm is described in M. D. McIlroy and J.W.           
    Hunt, Technical Report 41, 1976. To quote McIlroy, "I had tried at         
    least three completely different algorithms before the final one.          
    diff is a quintessential case of not settling for mere competency in       
    a program but revising until it was right."                   

In the Bell Laboratories UNIX 7th Edition source we see this claim:

*       Uses an algorithm due to Harold Stone, which finds
*       a pair of longest identical subsequences in the two
*       files.

and the history of the choice of algorithm can be found in here:
<URL:http://attila.stevens-tech.edu/~lbernste/njcse.html>

   The Tale of "diff":  a Caution for Developers of Reusable Modules.
   (with the permission of Doug McIlroy, inventor of diff)

   Once upon a time, there was a mathematical problem of finding the
   longest subsequence of lines common to two files.  "No sweat,"
   thought the developer.  A dynamic programming technique that takes
   time `mn' and space `mn' to compare an `m'-line file to an `n'-line
   file would do the trick.  But space `mn' was unacceptable on the
   small machines of yesteryear.  "OK, we'll fly seat of the pants,"
   thought our hero.  So he read through both files until he found a
   line that disagreed, and then figured he would somehow search back
   and forth in both until he got back in sync.  'Somehow' was the
   killer.  Suppose the second line in one file agreed with the fourth
   line ahead in the other and vice versa.  How to choose?

   Then news came from afar in Princeton that the Wizard Hirschberger
   had seen a way to reduce space `mn' by a mathematical method to space
   `m', while only doubling the time.  "Good deal!" thought our guy.
   "Now we can afford to run it.  It was slow, but it did work and gave
   an explainable 'right' answer in a clearly defined way."

   But the people complained.  When they moved a paragraph, it showed up
   as two changes, a deletion here and an addition there.  So our hero
   made a "diff" that found moves.  It was again seat of the pants, but
   it ran pretty well.  Yet, sometimes, an evil occurred.  If the people
   ran it on stuff where the same line occurred in many places, like
   assembly language or text processing, it discovered lots of deletions
   and additions that could be explained as moves.  Our hero was filled
   with consternation.  Then along came a shining knight, Harold Stone,
   with a dynamic programming technique that reduced the running time
   from the product to the sum of the file lengths, except in unnatural
   cases.  Now here was something fast enough to use on big files,
   efficient in space and time, mathematically justifiable as giving a
   good answer, and experimentally shown to be physiologically useful.

   But then the people tinkered.  Three times they altered output.  They
   added features.  They added stars!  And the tinkering caused the code
   to increase and the manual to swell to half again its size.  "Well,"
   said our guy.  "It is important to know when to stop."

I don't know if the "Stone" algorithm was ever formally published though.

People have since complained that the "Stone" algorithm is very accurate
but not very fast and uses some fixed amount of memory for each
difference found so more research was done.

>From the GNU Diff manual:

    The basic algorithm is described in "An O(ND) Difference Algorithm
    and its Variations", Eugene W. Myers, `Algorithmica' Vol. 1 No. 2,
    1986, pp. 251-266; and in "A File Comparison Program", Webb Miller
    and Eugene W. Myers, `Software--Practice and Experience' Vol. 15
    No. 11, 1985, pp. 1025-1040.  The algorithm was independently
    discovered as described in "Algorithms for Approximate String
    Matching", E. Ukkonen, `Information and Control' Vol. 64, 1985,
    pp. 100-118.


More detail of the implementation currently used by CVS is provided in
the GNU Diff manual in the section titled "What Comparison Means".
Unfortunately the Texinfo manual is not included in the CVS distribution
-- you'll have to obtain it separately in the GNU Diffutils distribution
(which you'll have on most any modern *BSD or Linux box already, just
type "info diff" and read all about it!).

> As I understand it, the algorithm is basically:
>     diff -u common-ancestor new-version-1
>     diff -u common-ancestor new-version-2
>
>     munge all the results together; call it a conflict if a difference
>     block from one "diff" overlaps a block from the other
>
> So basically it depends on the nature of the changes, and specifically on how
> the diff algorithm chooses to report them.

Actually it's a wee bit simpler than that (with the magic burried inside
of the 'diff3' program):

When CVS says:

        Merging differences between 1.X and 1.Y into somefile

it does so using the command:

        diff3 -ATam somefile somefile:1.X somefile:1.Y

(somefile:1.X and somefile:1.Y are temporary files created by checking
out the explicitly specified revisions)

Well, actually only my version of CVS uses '-AT'.  The original CVS uses '-E'

My version shows all the conflicting lines from all the three files, not
just the latter two, and I find this almost infinitely easier to deal
with when manually resolving the conflicts.

<<<<<<< somefile
changed line[s] unique to local copy of somefile
changed line[s] unique to local copy of somefile
changed line[s] unique to local copy of somefile
||||||| 1.X
matching line[s] unique to version 1.X of somefile
matching line[s] unique to version 1.X of somefile
matching line[s] unique to version 1.X of somefile
=======
matching line[s] unique to version 1.Y of somefile
matching line[s] unique to version 1.Y of somefile
matching line[s] unique to version 1.Y of somefile
>>>>>>> 1.Y



Here's an edited version of the manual page describing only just those
options my version of CVS uses to do the merge:


NAME
       diff3 - find differences between three files

SYNOPSIS
       diff3 [options] mine older yours

DESCRIPTION
       The   diff3  command  compares  three  files  and  outputs
       descriptions of their differences.

       The files to compare are mine, older, and yours.  At  most
       one  of these three file names may be -, which tells diff3
       to read the standard input for that file.

   Options

       -A     Incorporate all changes from older  to  yours  into
              mine, surrounding all conflicts with bracket lines.

       -T     Output a tab rather than two spaces before the text
              of a line in normal format.  This causes the align-
              ment of tabs in the line to look normal.

       -a     Treat  all  files as text and compare them line-by-
              line, even if they do not appear to be text.

       -m     Apply  the  edit  script to the first file and send
              the result to standard output.  Unlike  piping  the
              output from diff3 to ed, this works even for binary
              files and incomplete lines.  -A is  assumed  if  no
              edit script option is specified.


If anyone's interested I have changes to the "sanity.sh" tests to
accomodate my change to using "diff3 -A" and I can make them available
upon request.  Currently my version of sanity.sh is based on rev# 1.716

--
                                                                Greg A. Woods

+1 416 218-0098;  <[EMAIL PROTECTED]>;  <[EMAIL PROTECTED]>;  <[EMAIL PROTECTED]>
Planix, Inc. <[EMAIL PROTECTED]>; VE3TCP; Secrets of the Weird <[EMAIL PROTECTED]>

_______________________________________________
Info-cvs mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/info-cvs

Re: conflict algorithm

Reply via email to