Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Daniel Shahaf Thu, 23 Dec 2010 03:28:31 -0800

Johan Corveleyn wrote on Thu, Dec 23, 2010 at 01:51:08 +0100:
> On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin
> <[email protected]> wrote:
> > Johan Corveleyn <[email protected]> writes:
> >
> >> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
> >> <[email protected]> wrote:
> >>> Johan Corveleyn <[email protected]> writes:
> >>>
> >>>> This makes the diff algorithm another 10% - 15%
> >>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
> >>>> Mb file (60000 lines), of which most lines are identical
> >>>> prefix/suffix).
> >>>
> >>> Can you provide a test script?  Or decribe the test more fully, please.
> >>
> >> Hmm, it's not easy to come up with a test script to test this "from
> >> scratch" (unless with testing diff directly, see below). I test it
> >> with a repository (a dump/load of an old version of our production
> >> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
> >> revisions.
> >>
> >> I run blame on this file, over svnserve protocol on localhost (server
> >> running on same machine), with an svnserve built from Stefan^2's
> >> performance branch (with membuffer caching of full-texts, so server
> >> I/O is not the bottleneck). This gives me an easy way to call 2272
> >> times diff on this file, and measure it (with the help of some
> >> instrumentation code in blame.c, see attachment). And it's
> >> incidentally the actual use case I first started out wanting to
> >> optimize (blame for large files with many revisions).
> >
> > Testing with real-world data is important, perhaps even more important
> > than artificial test data, but some test data would be useful.  If you
> > were to write a script to generate two test files of size 100MB, say,
> > then you could use the tools/diff/diff utility to run Subversion diff on
> > those two files.  Or tools/diff/diff3 if it's a 3-way diff that matters.
> > The first run might involve disk IO, but on most machines the OS should
> > be able to cache the files and subsequent hot-cache runs should be a
> > good way to profile the diff code, assumming it is CPU limited.
> 
> Yes, that's a good idea. I'll try to spend some time on that. But I'm
> wondering about a good way to write such a script.
> 
> I'd like the script to generate large files quickly, and with content
> that's not totally random, but also not 1000000 times the exact same
> line (either of those are not going to be representative for real
> world data, might hit some edge behavior of the diff algorithm).


How about using

        cat subversion/libsvn_wc/*.c

as your test file?


It's currently 1.8MB, and should give plenty of prefix/suffix... (e.g.,
adm_crawler.c often is untouched)


> (maybe totally random is fine, but is there an easy/fast way to
> generate this?)
> 
> As a first attempt, I quickly hacked up a small shell script, writing
> out lines in a for loop, one by one, with a fixed string together with
> the line number (index of the iteration). But that's too slow (10000
> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
> 
> Maybe I can start with 10 or 20 different lines (or generate 100 in a
> for loop), and then start doubling that until I have enough (cat
> file.txt >> file.txt). That will probably be faster. And it might be
> "real-worldish" enough (a single source file also contains many
> identical lines, e.g. all lines with a single brace etc.).
> 
> Other ideas? Maybe there is already something like this lying around?
> 
> Another question: a shell script might not be good, because not
> portable (and not fast)? Should I use python for this? Maybe the
> "write line by line with a line number in a for loop" would be a lot
> faster in Python? I don't know a lot of python, but it might be a good
> opportunity to learn some ...
> 

IMO, use whatever language is most convenient for you to write the
script in.  (Generating the test data need not be fast since it's
a once-only task.)

> Are there any examples of such "manual test scripts" in svn? So I
> could have a look at the style, coding habits, ... maybe borrow some
> boilerplate code?
> 

If there are, they are in the source tree :-)  There are various Python
scripts under tools/ (particularly tools/dev/) and subversion/tests/,
but I'm not what how close they are to what you're looking for.

> Cheers,
> -- 
> Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to