On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcor...@gmail.com> wrote: > On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <mor...@gmail.com> wrote: >> [] >> Here's an updated patch based on the current HEAD, and with >> one or two minor improvements. While, as stated, the patch >> can give many-fold speed increases for ideal cases, it also slows >> down the LCS algorithm significantly for poorly matching files; I >> got about 18% increase in running time on my system for files >> that were very poor matches but had no lines unique to either file. > > Hi Morten, > > Hmmm, that sounds like a lot of overhead in some cases (I actually > wonder why -- I haven't fully understood the implementation, but on > first sight it doesn't seem that computationally intensive). But the > most important question that's bugging me: do those "ideal cases" > occur often in real-world examples? >
It's not surprising to get that kind of speed reduction, since it uses an extra test within the loops that take the bulk of the time when the files don't match. > I'm sure you could craft good examples, but how likely am I to have a > benefit of this in my average repository (given all of the other > optimizations taking away already a lot of the work (prefix/suffix > elimination; elimination of non-matching lines due to token-counting; > ...))? > > Can you give examples from subversion's own repository, or other > public repositories, where it can have a big impact? > I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD, or close enough) and 1036419 (semi-randomly chosen to have about the right fraction of changes from 1127198 - it was the first one I looked at, and seemed reasonable for the test). When running all of svn_diff_file_diff, the patch was about 15% faster, however, the LCS algorithm itself was about 3.5 times faster; most of the time was spent reading the tokens from file. The "problem" is that the files have to be quite big before the LCS algorithm itself dominates the running time even when the files still match reasonably well, whereas it is much easier for the LCS algorithm to dominate the running time when the files don't match at all, in which case this patch degrades performance. It is quite easy to do a quick heuristic test for whether this patch would help or not, so we could have alternate versions of the LCS algorithm depending on whether that test checks out. That would make the code somewhat harder to maintain, of course. > Or are you looking at this for some specific purpose, some repository > where this situation occurs regularly and has a big impact (huge > LCS'es ...)? > Nope, I just like to optimize stuff, and since we were already looking at diff optimization I thought I might as well point out the possibility. > -- > Johan >