Author: jcorvel
Date: Thu May 12 21:26:14 2011
New Revision: 1102471
URL: http://svn.apache.org/viewvc?rev=1102471&view=rev
Log:
* notes/diff-optimizations.txt: New file. Inspired by this thread:
http://svn.haxx.se/dev/archive-2011-02/0490.shtml
and some further thoughts.
Added:
subversion/trunk/notes/diff-optimizations.txt (with props)
Added: subversion/trunk/notes/diff-optimizations.txt
URL:
http://svn.apache.org/viewvc/subversion/trunk/notes/diff-optimizations.txt?rev=1102471&view=auto
==============================================================================
--- subversion/trunk/notes/diff-optimizations.txt (added)
+++ subversion/trunk/notes/diff-optimizations.txt Thu May 12 21:26:14 2011
@@ -0,0 +1,133 @@
+Introduction
+------------
+
+This file documents some potential optimizations for "svn diff" (or more
+specifically for the diff algorithm in libsvn_diff, which is used in the
+"diff", "blame", "merge" and "update" subcommands on the client side).
+
+There are two broad approaches:
+
+ - Speed up the existing algorithm, while maintaining a "minimal diff"
+ (i.e. without heuristics).
+
+ - Introduce a "non-minimal diff" mode, either by adding heuristics in the
+ existing algorithm, or by implementing another algorithm which doesn't
+ guarantee that it will produce the "minimal diff" in all cases.
+
+
+I. Speeding up "minimal" diff (no heuristics)
+---------------------------------------------
+
+1) More low-level optimization of prefix/suffix scanning.
+
+ - Further optimization of the N-ary prefix/suffix scanning.
+ - Add special-case code for N==2 (i.e. "svn diff").
+
+2) Optimize line hashing.
+
+ - Merge hash calculation with EOL scanning.
+ - Reduce function calling overhead, including loop setup & finalization.
+
+3) Reduce overhead of line/hash container.
+
+ - Use a low collision rate / low overhead hash container.
+
+4) Avoid some hashing by exploiting the fact that matching lines often come
+ in series.
+
+ - If the previous line had a match with the other file, first try to
+ directly compare (memcmp) the next line with the successor of the
+ matched line. Only if it doesn't match, calculate the hash to insert
+ it into the container.
+ - This approach probably conflicts with the "Merge hash calculation with
+ EOL scanning" suggestion.
+
+5) Discarding non-matching lines before running the LCS (Longest Common
+ Subsequence) algorithm.
+
+ - Can make a *huge* impact on files which have a lot of different/unique
+ lines.
+ (for instance, an example script from the users list [1], which
+ generates two files with random lines, 3/4th of which are
+ non-matching: diff goes from several hours to a couple of seconds;
+ another example is a big file of which indentation style was changed
+ from tabs to spaces, and diffing without an ignore-whitespace option).
+
+ - GNU diff does this as well, but strangely, it does it only as part
+ of a heuristic (which also discards other "confusing lines"). If run
+ with --minimal, these lines are not discarded (though, AFAICS, there
+ should be no problem discarding the non-matching lines even when going
+ for a minimal diff).
+
+ - From mailthread [2]:
+ The core algorithm for calculating a (minimal) diff is to first find
+ the Longest Common Subsequence (the longest sequence of lines which
+ appear both in A and in B (not necessarily as contiguous lines, there
+ can be gaps)). The actual diff can be derived from this very quickly.
+
+ But lines which only appear in A, or only in B, can never be part of
+ the LCS, because they are not common lines to begin with.
+
+ So we can just as well calculate the LCS of A' and B' (A and B
+ without their "definitely non-matching" lines). This will also be an
+ LCS of A and B, because there are no common lines added which can make
+ it even longer.
+
+ The only thing I'm not 100% sure of is if it would yield the same LCS
+ (there can be many LCS's, all equally long, hence the different
+ possible diff-representations which are sometimes not nice for human
+ readers). However, gut feeling tells me it will be the same. I can't
+ prove this though, so feel free to come up with a counter example :-).
+ (although having a different LCS would not be a disaster: it would
+ still be a minimal diff, but a different one).
+
+ The practical difficulty is to map line numbers from lines in A and B
+ to their corresponding lines in A' and B', and back again.
+
+
+II. Going for a non-minimal diff (i.e. heuristics)
+--------------------------------------------------
+
+In some cases, heuristics can make a big difference (while not guaranteeing
+that you'll get a minimal diff).
+See also issue #1966 (libsvn_diff needs 'non-minimal-diff' mode) [3].
+
+1) Make prefix/suffix scanning able to skip 1 or a couple of
+ non-matching lines, if it is able to find strictly more matching lines
+ after that, to keep the prefix/suffix scanning going.
+
+ This will usually work well, but can sometimes lead to missynchronization
+ (see [4]):
+
+ bxxaxxbxx
+ || ||
+ axxbxx
+
+ instead of (longest common subsequence):
+
+ bxxaxxbxx
+ //////
+ axxbxx
+
+2) Add some heuristics-based shortcuts in the LCS algorithm.
+
+3) Implement another diff algorithm, such as "Patience Diff" [5], which is
+ already implemented in several other (D)VCS's. It has the potential to
+ be much faster (reducing the problem to calculating several, much
+ smaller LCS's), and has the added advantage of often producing "nicer"
+ diff output. It is however slightly "heuristical", it doesn't guarantee
+ minimality of the diff.
+
+
+References
+----------
+
+[1] http://svn.haxx.se/users/archive-2011-01/0319.shtml
+[2] http://svn.haxx.se/dev/archive-2011-02/0772.shtml
+[3] http://subversion.tigris.org/issues/show_bug.cgi?id=1966 (libsvn_diff
+needs 'non-minimal-diff' mode)
+[4] Miller, W., and Myers, E.W. "A File Comparison Program.", Software -
+ Practice & Experience 15 (1985), pp. 1025-1040.
+[5] http://bramcohen.livejournal.com/73318.html
+[6] Mailthread with an earlier version of these notes, and some additional
+ discussion: http://svn.haxx.se/dev/archive-2011-02/0490.shtml
\ No newline at end of file
Propchange: subversion/trunk/notes/diff-optimizations.txt
------------------------------------------------------------------------------
svn:eol-style = native