On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <mor...@gmail.com> wrote: > On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcor...@gmail.com> wrote: >> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <mor...@gmail.com> wrote: >>> > [] >>> Here's an updated patch based on the current HEAD, and with >>> one or two minor improvements. While, as stated, the patch >>> can give many-fold speed increases for ideal cases, it also slows >>> down the LCS algorithm significantly for poorly matching files; I >>> got about 18% increase in running time on my system for files >>> that were very poor matches but had no lines unique to either file. >> >> Hi Morten, >> >> Hmmm, that sounds like a lot of overhead in some cases (I actually >> wonder why -- I haven't fully understood the implementation, but on >> first sight it doesn't seem that computationally intensive). But the >> most important question that's bugging me: do those "ideal cases" >> occur often in real-world examples? >> > > It's not surprising to get that kind of speed reduction, since it uses > an extra test within the loops that take the bulk of the time when > the files don't match.
Indeed, that might be it :-). Just wondering though: it isn't by any chance the case that, worst case, clear_fp is called too often (more often than is useful, each time resetting the LCS run for only marginal gain)? Can you verify that those 18% are spent because of those extra comparisons, and not because of calls to clear_fp? IIUC, with those poorly-matching files, there should never be a call to clear_fp, because the unique-matching optimization never kicks in, right? Also, those 18% you measured: is that the increase in svn_diff_file_diff time, or in lcs time? Or is that approximately the same, because the lcs time dominates because of the poor matching? I have two more, possibly stupid, questions on the implementation in the patch: - You mentioned in your theoretical explanation that you still have to scan the entire d-range for every p, even after you've found a "reset point". But can't you just "break" from the for-loops, whenever you encounter a reset point? - You calculated 'limit' as: limit = (d >= 0 ? 2 * p + d : 2 * p - d); and made the "reset condition" be: if (fp[k].uniquematchcount > limit + k) /* with k < 0 */ or: if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */ I don't fully understand. Can you explain a bit more why this is the right value for 'limit', and how this is the correct "reset condition"? >> I'm sure you could craft good examples, but how likely am I to have a >> benefit of this in my average repository (given all of the other >> optimizations taking away already a lot of the work (prefix/suffix >> elimination; elimination of non-matching lines due to token-counting; >> ...))? >> >> Can you give examples from subversion's own repository, or other >> public repositories, where it can have a big impact? >> > > I ran it on merge.c (in libsvn_client), between revisions 1127198 (HEAD, > or close enough) and 1036419 (semi-randomly chosen to have about > the right fraction of changes from 1127198 - it was the first one I looked > at, and seemed reasonable for the test). When running all of > svn_diff_file_diff, the patch was about 15% faster, however, the LCS > algorithm itself was about 3.5 times faster; most of the time was > spent reading the tokens from file. That's quite neat! It's a lot more than I expected in an average case. Maybe some more testing should be done (not saying you have to do this per say, just in general) to see how average this is :-). But for now it's already great to see a single real-world example like the one you just gave. > The "problem" is that the files have to be quite big before the LCS > algorithm itself dominates the running time even when the files still > match reasonably well, whereas it is much easier for the LCS > algorithm to dominate the running time when the files don't match > at all, in which case this patch degrades performance. Yeah, I know :-). The token-scanning/hashing phase is still quite a heavy factor in the diff performance (especially when the matching of the files is reasonably good (yielding a fast LCS-run), which is typically the case with changes in source code). Stefan Fuhrmann (stefan2) has some great ideas for optimizing the token-scanning phase, which could help a lot for that. We had some nice discussions about it during the Berlin Hackathon last month (but I didn't have the chance yet to dig into it further). But that's all definitely 1.8 or later stuff. Also, in the same vein, I think this current patch will have to wait for post-1.7. I like the idea a lot, and would definitely like to see it in svn one day, but it's not such a trivial change. With 1.7 just around the corner, we're trying to avoid changes that are not directly related to getting 1.7 (finally) out the door :-). > It is quite easy to do a quick heuristic test for whether this patch > would help or not, so we could have alternate versions of the LCS > algorithm depending on whether that test checks out. That would > make the code somewhat harder to maintain, of course. That might be a good idea, to get really the best of both worlds. Though I think it would also be acceptable to optimize for the common case (reasonably matching files (modulo all the "non-matching lines")), at a slight cost for the rare case (files that are matching terribly bad, even with "skipping all the non-matching lines"). Not sure right now, I guess it depends on the cost of the added code complexity. Definitely something to look at post-1.7 :-). >> Or are you looking at this for some specific purpose, some repository >> where this situation occurs regularly and has a big impact (huge >> LCS'es ...)? >> > > Nope, I just like to optimize stuff, and since we were already > looking at diff optimization I thought I might as well point out the > possibility. Cool :-). Me too. -- Johan