On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote: > The patch below implements an effort limit for Subversion's LCS diff.
Here is an updated version of this patch. I took a closer look at other diff implementations again, and realized that they don't use a fixed limit but one which is scaled to the size of the Myers graph. With a square root applied to keep the effort within certain bounds as the files being compared become larger. Compared to the previous patch, I have added comments, and renamed 'max_effort' to 'max_cost' to better align with existing documentation of the algorithm in existing comments. We already maintain a variable 'p' which represents the cost of the algorithm. So instead of maintaining a separate counter I am now setting an upper bound on 'p'. With this new patch the limit ends up being 2048 for the files I have. For regular files in a working copy of SVN trunk, the loop typically needs less than 10 iterations. For reference, run-times of 'svn diff' on my test files with various limits, including no limit: limit = 1024 0m06.82s real 0m05.40s user 0m01.41s system limit = 2048 0m08.15s real 0m06.55s user 0m01.56s system limit = 4096 0m11.10s real 0m09.59s user 0m01.52s system limit = 8192 0m18.40s real 0m16.57s user 0m01.57s system limit = 65536 7m55.40s real 7m53.81s user 0m01.58s system Full run without limit: 163m31.49s real 163m16.77s user 0m02.37s system Git/libxdiff somehow manages to produce a smaller collection of larger chunks on the problematic files, rather than many small chunks with one giant chunk at the end which we end up produing when we bail out of the loop. I have made some attempts to produce a more readable diff result in case the limit is reached but I haven't been successful. Ideas welcome. Still, I don't think it makes sense to invest a lot of time into optimizing output in this case, under the assumption that nobody is going to be carefully reading affected diffs anyway. [[[ * subversion/libsvn_diff/lcs.c (shift_sqrt): New helper function. Borrowed from Game of Trees' diff. (svn_diff__lcs): Limit the effort spent on finding an optimal diff to avoid spinning on the CPU for a long time (30 minutes or more) for some input files. Large XML documents were known to trigger this. ]] Index: subversion/libsvn_diff/lcs.c =================================================================== --- subversion/libsvn_diff/lcs.c (revision 1890390) +++ subversion/libsvn_diff/lcs.c (working copy) @@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines, return new_lcs; } +/* Integer square root approximation */ +static int +shift_sqrt(apr_off_t val) +{ + int i; + for (i = 1; val > 0; val >>= 2) + i <<= 1; + + return i; +} + svn_diff__lcs_t * svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) */ svn_diff__position_t *position_list2, /* pointer to tail (ring) */ @@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1 apr_off_t k; apr_off_t p = 0; svn_diff__lcs_t *lcs, *lcs_freelist = NULL; + int max_cost; + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */ + const int min_cost = 64; svn_diff__position_t sentinel_position[2]; @@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1 fp[d - 1].position[0] = sentinel_position[0].next; fp[d - 1].position[1] = &sentinel_position[1]; + /* Limit the effort spent looking for a diff snake. If files have + * very few lines in common, the effort spent to find nice diff + * snakes is just not worth it, the diff result will still be + * essentially minus everything on the left, plus everything on + * the right, with a few useless matches here and there. + * + * Very large files which have a lot of common lines interleaved with + * changed lines (such as humongous auto-generated XML documents) could + * easily keep us busy here for a very long time, in the order of hours. + * In such cases we abort the algorithm before it is finished and use + * the most recently computed intermediate result. The diff will not be + * pretty but a quick suboptimal result is better than a perfect result + * which takes hours to compute. + */ + max_cost = shift_sqrt(length[0] + length[1] / 2); + if (max_cost < min_cost) + max_cost = min_cost; + p = 0; do { @@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1 p++; } - while (fp[0].position[1] != &sentinel_position[1]); + while (fp[0].position[1] != &sentinel_position[1] && p < max_cost); if (suffix_lines) lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,