[PATCH] limit diff effort to fix performance issue

Stefan Sperling Wed, 02 Jun 2021 05:20:40 -0700

I've heard of several instances where users complain that 'svn update'
takes an extraordinary amount of time to run (in terms of hours).


At elego we have received files which reproduce such behaviour.
These files are XML files that are almost 100MB in size. While versioning
such data is not the primary use case for Subversion, this should not get
in the way of people getting work done. So I would like to get this fixed.

The root of the performance problem is located in Subversion's diff code.

The files contain in the order of 1.200.000 lines of XML text. I cannot
share the full files, unfortunately. Here's one diff chunk from a diff
between two such files, just to give you an idea about the type of content
which the diff algorithm is trying to split into chunks. Simply imagine that
such content goes on for more than a million lines and you have a mental
image of what the file generally looks like:

                   <MC-DATA-INSTANCE>                                           
                                                      
                     <SHORT-NAME>CounterDbCounter</SHORT-NAME>     
                     <CATEGORY>ARRAY</CATEGORY>                    
-                    <ARRAY-SIZE>163</ARRAY-SIZE>                  
+                    <ARRAY-SIZE>193</ARRAY-SIZE>^M                
                     <SUB-ELEMENTS>                                
                       <MC-DATA-INSTANCE>                          
                         
<SHORT-NAME>RamEccDoubleBitError_CounterDbCounter</SHORT-NAME>                  
                             

Generating a diff between two versions of such files with 'svn diff' takes
more than 40 minutes on my machine. I stopped it at that point. This is
obviously taking too long and we don't have to care about exactly how
long it will take to eventually run to completion.

As a reference, Git manages to produce a diff in about 13 seconds:

$ time git diff 01.arxml 02.arxml > /dev/null
    0m13.39s real     0m13.04s user     0m00.36s system

Without my patch, the svn update/diff processes are spinning in the
following loop:

[[[
  do
    {
      /* For k < 0, insertions are free */
      for (k = (d < 0 ? d : 0) - p; k < 0; k++)
        {
          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
        }
      /* for k > 0, deletions are free */
      for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
        {
          svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool);
        }

      p++;
    }
  while (fp[0].position[1] != &sentinel_position[1]);
]]]

What this loop is trying to do is to find an optimal "diff snake",
a partial traversal of the Myers graph which builds up step-by-step
towards some optimal traversal of the entire graph.

In the Myers diff algorithm, the problem of finding an optimal graph
traversal doesn't have a single answer. There can be several valid
traversals, resulting in differences in the generated diff.
An extreme example is a traversal which produces a patch that first
removes all lines from file A, and then adds all lines from file B.
This isn't a useful diff chunk for humans to read, but it is a valid
diff and will work when applied with a patch program.
Finding this particular traversal is also trivial and very fast :-)

The search for an optimal traversal which will result in a human-readable
diff can take time, and this is exactly where Subversion is spending too
much effort when diffing large XML files.

Other diff implementations limit the effort spent by aborting after a
certain number of search iterations, and then simply use the valid
traversal which was most recently discovered. The libxdiff code used by Git
does this, as does the diff implementation which Neels Hofmyer wrote for
Game of Trees (see https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary)

The patch below implements an effort limit for Subversion's LCS diff.
Diffing the offending XML files in a Subversion working copy becomes
fast enough to be usable:

$ time svn diff > /dev/null
    0m06.82s real     0m05.40s user     0m01.41s system

The resulting diff doesn't look as pretty as Git's diff output, but I don't
think this will matter in practice.

My only concern is that some corner-case diffs that are currently readable
could become less readable. If such cases are found, we could easily address
the problem by bumping the limit. Note: Any such diffs would still be *valid*.

The test suite is passing, which implies that trivial diffs aren't affected
by this change. I expect that most, if not all, diffs which people actually
want to read will remain unaffected. But I cannot tell with 100% certainty.

Before anyone asks, without evidence that it is really needed I don't think
adding a config option to set this limit is warranted. Other implementations
are also using hard-coded limits. We could add an environment variable to
allow for experimentation, perhaps. In any case, I would prefer to make
such tweaks in follow-up patches and keep this initial fix simple so it
can be backported easily.

[[[
* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c        (revision 1890384)
+++ subversion/libsvn_diff/lcs.c        (working copy)
@@ -247,6 +247,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   apr_off_t k;
   apr_off_t p = 0;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
+  int maxeffort = 1024;
 
   svn_diff__position_t sentinel_position[2];
 
@@ -353,7 +354,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
 
       p++;
     }
-  while (fp[0].position[1] != &sentinel_position[1]);
+  while (fp[0].position[1] != &sentinel_position[1] && --maxeffort > 0);
 
   if (suffix_lines)
     lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,

[PATCH] limit diff effort to fix performance issue

Reply via email to