I've heard of several instances where users complain that 'svn update' takes an extraordinary amount of time to run (in terms of hours).
At elego we have received files which reproduce such behaviour. These files are XML files that are almost 100MB in size. While versioning such data is not the primary use case for Subversion, this should not get in the way of people getting work done. So I would like to get this fixed. The root of the performance problem is located in Subversion's diff code. The files contain in the order of 1.200.000 lines of XML text. I cannot share the full files, unfortunately. Here's one diff chunk from a diff between two such files, just to give you an idea about the type of content which the diff algorithm is trying to split into chunks. Simply imagine that such content goes on for more than a million lines and you have a mental image of what the file generally looks like: <MC-DATA-INSTANCE> <SHORT-NAME>CounterDbCounter</SHORT-NAME> <CATEGORY>ARRAY</CATEGORY> - <ARRAY-SIZE>163</ARRAY-SIZE> + <ARRAY-SIZE>193</ARRAY-SIZE>^M <SUB-ELEMENTS> <MC-DATA-INSTANCE> <SHORT-NAME>RamEccDoubleBitError_CounterDbCounter</SHORT-NAME> Generating a diff between two versions of such files with 'svn diff' takes more than 40 minutes on my machine. I stopped it at that point. This is obviously taking too long and we don't have to care about exactly how long it will take to eventually run to completion. As a reference, Git manages to produce a diff in about 13 seconds: $ time git diff 01.arxml 02.arxml > /dev/null 0m13.39s real 0m13.04s user 0m00.36s system Without my patch, the svn update/diff processes are spinning in the following loop: [[[ do { /* For k < 0, insertions are free */ for (k = (d < 0 ? d : 0) - p; k < 0; k++) { svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool); } /* for k > 0, deletions are free */ for (k = (d > 0 ? d : 0) + p; k >= 0; k--) { svn_diff__snake(fp + k, token_counts, &lcs_freelist, pool); } p++; } while (fp[0].position[1] != &sentinel_position[1]); ]]] What this loop is trying to do is to find an optimal "diff snake", a partial traversal of the Myers graph which builds up step-by-step towards some optimal traversal of the entire graph. In the Myers diff algorithm, the problem of finding an optimal graph traversal doesn't have a single answer. There can be several valid traversals, resulting in differences in the generated diff. An extreme example is a traversal which produces a patch that first removes all lines from file A, and then adds all lines from file B. This isn't a useful diff chunk for humans to read, but it is a valid diff and will work when applied with a patch program. Finding this particular traversal is also trivial and very fast :-) The search for an optimal traversal which will result in a human-readable diff can take time, and this is exactly where Subversion is spending too much effort when diffing large XML files. Other diff implementations limit the effort spent by aborting after a certain number of search iterations, and then simply use the valid traversal which was most recently discovered. The libxdiff code used by Git does this, as does the diff implementation which Neels Hofmyer wrote for Game of Trees (see https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary) The patch below implements an effort limit for Subversion's LCS diff. Diffing the offending XML files in a Subversion working copy becomes fast enough to be usable: $ time svn diff > /dev/null 0m06.82s real 0m05.40s user 0m01.41s system The resulting diff doesn't look as pretty as Git's diff output, but I don't think this will matter in practice. My only concern is that some corner-case diffs that are currently readable could become less readable. If such cases are found, we could easily address the problem by bumping the limit. Note: Any such diffs would still be *valid*. The test suite is passing, which implies that trivial diffs aren't affected by this change. I expect that most, if not all, diffs which people actually want to read will remain unaffected. But I cannot tell with 100% certainty. Before anyone asks, without evidence that it is really needed I don't think adding a config option to set this limit is warranted. Other implementations are also using hard-coded limits. We could add an environment variable to allow for experimentation, perhaps. In any case, I would prefer to make such tweaks in follow-up patches and keep this initial fix simple so it can be backported easily. [[[ * subversion/libsvn_diff/lcs.c (svn_diff__lcs): Limit the effort spent on finding an optimal diff to avoid spinning on the CPU for a long time (30 minutes or more) for some input files. Large XML documents were known to trigger this. ]] Index: subversion/libsvn_diff/lcs.c =================================================================== --- subversion/libsvn_diff/lcs.c (revision 1890384) +++ subversion/libsvn_diff/lcs.c (working copy) @@ -247,6 +247,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1 apr_off_t k; apr_off_t p = 0; svn_diff__lcs_t *lcs, *lcs_freelist = NULL; + int maxeffort = 1024; svn_diff__position_t sentinel_position[2]; @@ -353,7 +354,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1 p++; } - while (fp[0].position[1] != &sentinel_position[1]); + while (fp[0].position[1] != &sentinel_position[1] && --maxeffort > 0); if (suffix_lines) lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,