Research on rename-aware merging -- or rather of ordered (XML) trees

Julian Foad Wed, 12 Sep 2012 07:43:51 -0700

Hi, merge fans!

I've started doing some research for how we can update 'merge' to fully support 
renames.


Surprisingly to me, I have so far found very little written about merging trees 
of files.  If anyone knows how to find some, I'd be glad.  On the other hand, I 
found several papers on 3-way merging XML data, which is somewhat analogous in 
that it is a tree structure of nodes that have both content and children.

This email is to summarize some useful points I have taken from a paper called 
"A 3-way Merging Algorithm for Synchronizing Ordered Trees -- the 3DM merging 
and differencing tool for XML" [1], which I'll refer to as "3DM" for short.  
This is all at the algorithmic level, far removed from practical implementation 
concerns.

* First *match* the three versions of each node (in the base, branch-1, and 
branch-2 trees); then *merge* the matched sets.  That is, treat matching and 
merging as two distinct problems.

* Matching means deciding which node in the target tree corresponds to which 
node in the base tree, and which node in the source tree corresponds to which 
node in the base tree, and thus which source node corresponds to which target 
node.  Matching is all about following moves and copies.  Matching can be 
deterministic (using node-id information) or heuristic (using similarity 
detection).  Matching is explicit: it builds a list of "edges" connecting these 
nodes, and that list is then used as an input to the merging stage.  (That 
doesn't necessarily mean it has to be completed before the merging stage can 
begin.)

  -- In Subversion's current merge code the matching is implicit (and 
simple): it assumes that a relative path 'foo/bar' inside the root of 
the branch-1 tree matches relative path 'foo/bar' inside the root of the
base tree and of the branch-2 tree and of the merged result tree.

* One interesting thing about matching is whether to match copies: if branch-1 
copies 'foo' to 'bar', then do we match 'foo' in the base tree to both 'foo' 
and 'bar' in branch-1?  That depends on whether we want to propagate the 
modification to both copies when a node is copied in branch 1 and modified in 
branch 2.  The 3DM paper takes the view that one node should be declared the 
"original" and receive all changes, and any other nodes that look like copies 
of it should be flagged as 'partial matches' and receive *some* changes.  I 
don't fully understand that scheme yet, but the idea seems to come from the 
assumption that the matching is heuristic, based on similarity, and a desire to 
propagate content changes and/or tree changes to all copies, but only under 
certain conditions.

  -- In Subversion, we will likely prefer deterministic matching, perhaps with 
a fallback to similarity matching if we can't get node-id information from an 
old server.  If we decide (and let's not be too hasty to make this decision) 
that we 
should never merge changes into a copy but only into the "original" 
node, then the matchings need only be 1:1.  Otherwise, we will need to make our 
own plan for whether we match copies and, if so, to what degree we propagate 
changes to them. [5]

* In 3DM, an XML tree node is either an *element* or *text*:

  <p>          -- n1: elem children={n2,n3,n5}
    Hello,     -- n2: text
    <emph>     -- n3: elem children={n4}
      my       -- n4: text
    </emph>    -- end tags are implicit, not nodes
    world!     -- n5: text
  </p>

An element node has a *content* (its XML attributes) and a *child list* (a list 
of element nodes and text nodes); a text node is a leaf node and has *content* 
only (the text).

  -- In Subversion, by analogy, a directory has a child list and its content is 
its set of properties; files (and symlinks) are leaf nodes.

* In what order do we process the nodes, given that there are multiple trees 
whose nodes may be in different orders?  Answer: Starting at the root node, 
merge one node's *content*, and (separately) its *child list*.  Then recurse on 
each child of the resulting merged node.  That is, a top-down, breadth-first 
traversal of the merged result tree.

* A child list in 3DM is an *ordered* list of non-unique nodes (they are not 
assumed to have unique ids), e.g.

  <foo>
    <p>hello</p>
    <p>goodbye</p>
    <p>hello</p>
  </foo>

and much of the design of 3DM, and its sets of merge primitives and kinds of 
conflict, are related to traacking and preserving node-ordering relationships.

  -- In Subversion, a child list is an *unordered* list, and unique (by name) 
within a directory.  Thus we will have a very different set of merge primitives 
and of conflicts.  This difference is almost enough to make the XML merging 
problem seem irrelevant, but I am finding there is still enough common ground 
to be useful.

* The primitives of *change* in 3DM are:

  Update (we say 'modify') -- a change of content
  Insert (we say 'add')
  Delete
  Move
  Copy

  -- These make sense for Subversion too.

* 3DM describes a set of "functional requirements", while I'll paraphrase and 
comment on:

  1. Easy to understand.  -- No quibble there.

  2a. A change on one side is always propagated to the result.  -- Agreed.

  2b. There is no restriction on the positions of the source and destination of 
a move or copy.  (For example, a restriction could be that a copy is only 
recognized if the source and destination are in the same directory.)  -- Agreed.

  3. Changes are interpreted relative to their position in the tree, not 
applied at their absolute position in the tree.  -- Agreed.

  4. A node exists in the *context* of its siblings and parent; this context 
should be preserved in the result.  -- The 'parent' part of the context pretty 
much follows from (3), while the 'siblings' part of the context is closely 
related to ordering and so may hardly be relevant in Subversion.  Nevertheless, 
this idea bears further consideration.

  5. In a copy-vs-modify situation, 3DM chooses that the modification should be 
applied to all the copies, yet states that this isn't always wanted.  -- In 
Subversion, I don't think we want to do this; we might want it to be optional, 
controlled by a flag in a 'merge policy' or by some sort of callback.

  6. Inserting a node at the same place on each branch: the result could be 
"insert both" or a conflict.  -- The Subversion equivalent is insertions of the 
same name, which we would normally expect to be a conflict, but we might want 
the option to automatically resolve identical duplicate adds.

  7. The merge is symmetric: branch 1 and branch 2 can be swapped and the 
result will be the same.  -- In Subversion, yes, that's desirable too. [4]

  8a. Update-vs-update is a conflict.  -- Yes.  (May want an option to 
automatically resolve identical duplicate updates.)

  8b. Del-vs-move is a conflict.  Del-vs-update is a conflict.  Del-vs-copy is 
a conflict.  -- The first two are obvious.  The third one is more interesting, 
and something we haven't (to my knowledge) considered before.  I agree it 
should be a conflict.

  8c. Move-vs-move is a conflict.  -- Agreed.


The same author wrote a later (and much shorter) paper, "A Three-way Merge for 
XML Documents" [2], in which a "more elegant variation of the algorithm 
discards support for copy operations, at the same time gaining enormously in 
clarity and
simplicity" [3].  I'll summarize it briefly.

That paper concentrates on merging (not on matching) trees of XML nodes, using 
as its central principle one of the main ideas from the first paper: the idea 
of preserving 'parent-child-successor' (PCS) contexts.  If a node is changed, 
then the context of being a child of its parent node, and next to its left 
sibling (or start of list) and next to its right sibling (or end of list) must 
be preserved in the result tree.  If a node has moved, then both the PCS 
context of the 'gap' it left in its source position (where it was in the base 
tree) and the new PCS contexts it creates (to its left and right) in the its 
target position must be preserved in the result tree.  The merged result tree 
must contain all of the (p-c-s) contexts associated with changes in branch-1 
and all those associated with branch-2 as well.  If the context of a change in 
branch-1 is not compatible with the context a change in branch-2, there is a 
conflict.

I can't see any useful analogy of P-C-S contexts in Subversion, except for the 
'Parent' part of it, because it's entirely geared towards preserving ordering 
in child lists.  In an abstract way I can see how it could be useful to loosely 
associate a node with all its siblings, so that for example if branch-1 moves 
all the children of directory 'notes/' into directory 'doc/', and branch-2 adds 
one new child 'notes/foo', then a suggested result could be that 'foo' should 
be either added in 'doc/' or flagged as a conflict, because all of its sibling 
context has gone away.  But that's rather vague and I'm confident that we don't 
need any such thing.

Feel free to query or comment on any of these ideas: I'd be glad to have some 
discussion.

- Julian

[1] Lindholm, T., "A 3-way Merging Algorithm for Synchronizing Ordered Trees -- 
the 3DM merging and differencing tool for XML." (2001). 
<http://www.cs.hut.fi/~ctl/3dm/thesis.pdf>

[2] Lindholm, T., "AThree-way Merge for XML Documents" (2004). 
<http://www.hiit.fi/files/fi/fc/papers/doceng04-pc.pdf>.

[3] Web page: 'The "3DM" XML 3-way Merging and Differencing Tool', 
<http://tdm.berlios.de/3dm/doc/index.html>.

[4] There may be some edge cases where the simplest implementation would
allow the order of the two branches to leak through.  For example, if 
we chose to resolve non-identical duplicate adds of a node 'foo' by 
adding both nodes, giving one of them a new name, we might name the node
from branch-1 'foo-1' and the node from branch-2 'foo-2'.  However, it 
would be possible to avoid this leakage by comparing some other property
of the nodes (time stamp, content, ...) rather than which side of the 
merge they came from, to decide which one is named '-1' and which '-2'.

[5] We may also be able to distinguish a *branch* from a *copy*, and if a file 
is branched rather than just copied (a sub-tree branched within 
the main-tree branch -- ugh) then we need to decide what to do there as 
well.


--
Subversion Live 2012 - 2 Days: training, networking, demos & more! 
http://bit.ly/NgUaRi
Certified & Supported Apache Subversion Downloads: 
http://www.wandisco.com/subversion/download

Research on rename-aware merging -- or rather of ordered (XML) trees

Reply via email to