On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote:
> The patch below implements an effort limit for Subversion's LCS diff.

Here is an updated version of this patch. I took a closer look at
other diff implementations again, and realized that they don't use
a fixed limit but one which is scaled to the size of the Myers graph.
With a square root applied to keep the effort within certain bounds
as the files being compared become larger.

Compared to the previous patch, I have added comments, and renamed
'max_effort' to 'max_cost' to better align with existing documentation
of the algorithm in existing comments.

We already maintain a variable 'p' which represents the cost of
the algorithm. So instead of maintaining a separate counter I am
now setting an upper bound on 'p'.

With this new patch the limit ends up being 2048 for the files I have.
For regular files in a working copy of SVN trunk, the loop typically
needs less than 10 iterations.

For reference, run-times of 'svn diff' on my test files with various
limits, including no limit:

limit = 1024
    0m06.82s real     0m05.40s user     0m01.41s system
limit = 2048
    0m08.15s real     0m06.55s user     0m01.56s system
limit = 4096
    0m11.10s real     0m09.59s user     0m01.52s system
limit = 8192
    0m18.40s real     0m16.57s user     0m01.57s system
limit = 65536
    7m55.40s real     7m53.81s user     0m01.58s system
Full run without limit:
  163m31.49s real   163m16.77s user     0m02.37s system

Git/libxdiff somehow manages to produce a smaller collection of larger
chunks on the problematic files, rather than many small chunks with
one giant chunk at the end which we end up produing when we bail out
of the loop.
I have made some attempts to produce a more readable diff result in
case the limit is reached but I haven't been successful. Ideas welcome.
Still, I don't think it makes sense to invest a lot of time into
optimizing output in this case, under the assumption that nobody is
going to be carefully reading affected diffs anyway.

[[[
* subversion/libsvn_diff/lcs.c
  (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c        (revision 1890390)
+++ subversion/libsvn_diff/lcs.c        (working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
   return new_lcs;
 }
 
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+  int i;
 
+  for (i = 1; val > 0; val >>= 2)
+    i <<= 1;
+
+  return i;
+}
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
               svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
@@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   apr_off_t k;
   apr_off_t p = 0;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
+  int max_cost;
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost = 64;
 
   svn_diff__position_t sentinel_position[2];
 
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   fp[d - 1].position[0] = sentinel_position[0].next;
   fp[d - 1].position[1] = &sentinel_position[1];
 
+  /* Limit the effort spent looking for a diff snake. If files have
+   * very few lines in common, the effort spent to find nice diff
+   * snakes is just not worth it, the diff result will still be
+   * essentially minus everything on the left, plus everything on
+   * the right, with a few useless matches here and there.
+   *
+   * Very large files which have a lot of common lines interleaved with
+   * changed lines (such as humongous auto-generated XML documents) could
+   * easily keep us busy here for a very long time, in the order of hours.
+   * In such cases we abort the algorithm before it is finished and use
+   * the most recently computed intermediate result. The diff will not be
+   * pretty but a quick suboptimal result is better than a perfect result
+   * which takes hours to compute.
+   */
+  max_cost = shift_sqrt(length[0] + length[1] / 2);
+  if (max_cost < min_cost)
+    max_cost = min_cost;
+
   p = 0;
   do
     {
@@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

Reply via email to