Author: jcorvel
Date: Thu May 12 21:26:14 2011
New Revision: 1102471

URL: http://svn.apache.org/viewvc?rev=1102471&view=rev
Log:
* notes/diff-optimizations.txt: New file. Inspired by this thread:
   http://svn.haxx.se/dev/archive-2011-02/0490.shtml
   and some further thoughts.

Added:
    subversion/trunk/notes/diff-optimizations.txt   (with props)

Added: subversion/trunk/notes/diff-optimizations.txt
URL: 
http://svn.apache.org/viewvc/subversion/trunk/notes/diff-optimizations.txt?rev=1102471&view=auto
==============================================================================
--- subversion/trunk/notes/diff-optimizations.txt (added)
+++ subversion/trunk/notes/diff-optimizations.txt Thu May 12 21:26:14 2011
@@ -0,0 +1,133 @@
+Introduction
+------------
+
+This file documents some potential optimizations for "svn diff" (or more
+specifically for the diff algorithm in libsvn_diff, which is used in the
+"diff", "blame", "merge" and "update" subcommands on the client side).
+
+There are two broad approaches:
+
+  - Speed up the existing algorithm, while maintaining a "minimal diff" 
+    (i.e. without heuristics).
+    
+  - Introduce a "non-minimal diff" mode, either by adding heuristics in the
+    existing algorithm, or by implementing another algorithm which doesn't
+    guarantee that it will produce the "minimal diff" in all cases.
+
+
+I. Speeding up "minimal" diff (no heuristics)
+---------------------------------------------
+
+1) More low-level optimization of prefix/suffix scanning.
+
+  - Further optimization of the N-ary prefix/suffix scanning.
+  - Add special-case code for N==2 (i.e. "svn diff").
+
+2) Optimize line hashing.
+
+  - Merge hash calculation with EOL scanning.
+  - Reduce function calling overhead, including loop setup & finalization.
+
+3) Reduce overhead of line/hash container.
+
+  - Use a low collision rate / low overhead hash container.
+
+4) Avoid some hashing by exploiting the fact that matching lines often come
+   in series.
+
+  - If the previous line had a match with the other file, first try to
+    directly compare (memcmp) the next line with the successor of the
+    matched line. Only if it doesn't match, calculate the hash to insert
+    it into the container.
+  - This approach probably conflicts with the "Merge hash calculation with 
+    EOL scanning" suggestion.
+
+5) Discarding non-matching lines before running the LCS (Longest Common
+   Subsequence) algorithm.
+
+  - Can make a *huge* impact on files which have a lot of different/unique
+    lines.
+    (for instance, an example script from the users list [1], which
+    generates two files with random lines, 3/4th of which are
+    non-matching: diff goes from several hours to a couple of seconds;
+    another example is a big file of which indentation style was changed
+    from tabs to spaces, and diffing without an ignore-whitespace option).
+
+  - GNU diff does this as well, but strangely, it does it only as part
+    of a heuristic (which also discards other "confusing lines"). If run
+    with --minimal, these lines are not discarded (though, AFAICS, there
+    should be no problem discarding the non-matching lines even when going
+    for a minimal diff).
+
+  - From mailthread [2]:
+    The core algorithm for calculating a (minimal) diff is to first find
+    the Longest Common Subsequence (the longest sequence of lines which
+    appear both in A and in B (not necessarily as contiguous lines, there
+    can be gaps)). The actual diff can be derived from this very quickly.
+
+    But lines which only appear in A, or only in B, can never be part of
+    the LCS, because they are not common lines to begin with.
+
+    So we can just as well calculate the LCS of A' and B' (A and B
+    without their "definitely non-matching" lines). This will also be an
+    LCS of A and B, because there are no common lines added which can make
+    it even longer.
+
+    The only thing I'm not 100% sure of is if it would yield the same LCS
+    (there can be many LCS's, all equally long, hence the different
+    possible diff-representations which are sometimes not nice for human
+    readers). However, gut feeling tells me it will be the same. I can't
+    prove this though, so feel free to come up with a counter example :-).
+    (although having a different LCS would not be a disaster: it would
+    still be a minimal diff, but a different one).
+
+    The practical difficulty is to map line numbers from lines in A and B
+    to their corresponding lines in A' and B', and back again.
+
+
+II. Going for a non-minimal diff (i.e. heuristics)
+--------------------------------------------------
+
+In some cases, heuristics can make a big difference (while not guaranteeing
+that you'll get a minimal diff).
+See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [3].
+
+1) Make prefix/suffix scanning able to skip 1 or a couple of 
+   non-matching lines, if it is able to find strictly more matching lines
+   after that, to keep the prefix/suffix scanning going.
+
+   This will usually work well, but can sometimes lead to missynchronization
+   (see [4]):
+
+     bxxaxxbxx
+      || ||
+     axxbxx
+
+   instead of (longest common subsequence):
+       
+     bxxaxxbxx
+       //////
+      axxbxx 
+
+2) Add some heuristics-based shortcuts in the LCS algorithm.
+      
+3) Implement another diff algorithm, such as "Patience Diff" [5], which is
+   already implemented in several other (D)VCS's. It has the potential to
+   be much faster (reducing the problem to calculating several, much
+   smaller LCS's), and has the added advantage of often producing "nicer"
+   diff output. It is however slightly "heuristical", it doesn't guarantee
+   minimality of the diff.
+
+
+References
+----------
+
+[1] http://svn.haxx.se/users/archive-2011-01/0319.shtml
+[2] http://svn.haxx.se/dev/archive-2011-02/0772.shtml
+[3] http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
+needs 'non-minimal-diff' mode)
+[4] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
+    Practice & Experience 15 (1985), pp. 1025-1040.
+[5] http://bramcohen.livejournal.com/73318.html
+[6] Mailthread with an earlier version of these notes, and some additional
+    discussion: http://svn.haxx.se/dev/archive-2011-02/0490.shtml
\ No newline at end of file

Propchange: subversion/trunk/notes/diff-optimizations.txt
------------------------------------------------------------------------------
    svn:eol-style = native


Reply via email to