Daniel Shahaf wrote on Thu, Dec 23, 2010 at 13:25:40 +0200: > Johan Corveleyn wrote on Thu, Dec 23, 2010 at 01:51:08 +0100: > > On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin > > <philip.mar...@wandisco.com> wrote: > > > Johan Corveleyn <jcor...@gmail.com> writes: > > > > > >> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin > > >> <philip.mar...@wandisco.com> wrote: > > >>> Johan Corveleyn <jcor...@gmail.com> writes: > > >>> > > >>>> This makes the diff algorithm another 10% - 15% > > >>>> faster (granted, this was measured with my "extreme" testcase of a 1,5 > > >>>> Mb file (60000 lines), of which most lines are identical > > >>>> prefix/suffix). > > >>> > > >>> Can you provide a test script? Or decribe the test more fully, please. > > >> > > >> Hmm, it's not easy to come up with a test script to test this "from > > >> scratch" (unless with testing diff directly, see below). I test it > > >> with a repository (a dump/load of an old version of our production > > >> repository) which contains this 60000 line xml file (1,5 Mb) with 2272 > > >> revisions. > > >> > > >> I run blame on this file, over svnserve protocol on localhost (server > > >> running on same machine), with an svnserve built from Stefan^2's > > >> performance branch (with membuffer caching of full-texts, so server > > >> I/O is not the bottleneck). This gives me an easy way to call 2272 > > >> times diff on this file, and measure it (with the help of some > > >> instrumentation code in blame.c, see attachment). And it's > > >> incidentally the actual use case I first started out wanting to > > >> optimize (blame for large files with many revisions). > > > > > > Testing with real-world data is important, perhaps even more important > > > than artificial test data, but some test data would be useful. If you > > > were to write a script to generate two test files of size 100MB, say, > > > then you could use the tools/diff/diff utility to run Subversion diff on > > > those two files. Or tools/diff/diff3 if it's a 3-way diff that matters. > > > The first run might involve disk IO, but on most machines the OS should > > > be able to cache the files and subsequent hot-cache runs should be a > > > good way to profile the diff code, assumming it is CPU limited. > > > > Yes, that's a good idea. I'll try to spend some time on that. But I'm > > wondering about a good way to write such a script. > > > > I'd like the script to generate large files quickly, and with content > > that's not totally random, but also not 1000000 times the exact same > > line (either of those are not going to be representative for real > > world data, might hit some edge behavior of the diff algorithm). > > How about using > > cat subversion/libsvn_wc/*.c > > as your test file? >
As to time: t1/subversion% time cat */*c | wc -c cat: tests/libsvn_wc: Is a directory 9484278 cat */*c 0.00s user 0.05s system 4% cpu 1.248 total wc -c 0.00s user 0.01s system 0% cpu 1.243 total (but I ran 'make' earlier, so it might not be a cold cache) > > > (maybe totally random is fine, but is there an easy/fast way to > > generate this?) > > > > As a first attempt, I quickly hacked up a small shell script, writing > > out lines in a for loop, one by one, with a fixed string together with > > the line number (index of the iteration). But that's too slow (10000 > > lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds). > > > > Maybe I can start with 10 or 20 different lines (or generate 100 in a > > for loop), and then start doubling that until I have enough (cat > > file.txt >> file.txt). That will probably be faster. And it might be > > "real-worldish" enough (a single source file also contains many > > identical lines, e.g. all lines with a single brace etc.). > > > > Other ideas? Maybe there is already something like this lying around? > > > > Another question: a shell script might not be good, because not > > portable (and not fast)? Should I use python for this? Maybe the > > "write line by line with a line number in a for loop" would be a lot > > faster in Python? I don't know a lot of python, but it might be a good > > opportunity to learn some ... > > > > IMO, use whatever language is most convenient for you to write the > script in. (Generating the test data need not be fast since it's > a once-only task.) That is: *in my opinion* it doesn't need to be fast. But re-reading your mail, I gather you think otherwise. Why? I assumed you'd run the script once, generate a repository, then (commit that repository to ^/tags somewhere for safekeeping and) work with that repository thereafter without regeneraeting it each time; so generating wouldn't need to be fast.