On Thu, Dec 23, 2010 at 3:44 AM, Gavin Beau Baumanis <gav...@thespidernet.com> wrote: > Hi Johan,
Hi Gavin. Thanks for your interest. It's a nice problem isn't it? > I was intrigued by your requirement to create a large file for testing. > I remember from a really long time ago when I learnt C, that we used a > specific algorithm for creating "natural" and "random" text. > With some help from Mr.Google found out about Markov Chains that look > promising - I can't remember if that was what I learned about or not - but > it looks like it might be a prove helpful none the less. > > A little further Googlng and I found this specific post on stackoverflow. > http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content Interesting link. If I'm reading it correctly, the best suggestion in there was with those Markov Chains. But it seems maybe a little overkill / too much work for me. > No Idea if it is going to help you specifically or not... but there are quite > a few ideas in the comments; > * Obtain a copy of the first 100MB from wikipedia - for example. Nah, I wouldn't like to depend on some internet resource. > Finally, if you happen to a large enough file already, could you not use > "split" (unix) function to give you a specific sized file? Yes, but we'll first need that large enough file :-). > > Actually - I was so intrigued by the "challenge" of this - I have had a think > of it over lunch; > > Could you not do this? > (pseudo -code) > > Start with a known "chunk" - say the license file. > get length of license file - for example only = 125bytes > append the chunk to itself until such time as you have the required size. > write the file once. > > startSize = length(knownFile); > int numberOfLoops = log2(desiredFileSize / knownFile) ; > newFile = knownFile > > Loop for numberOfLoops > { > newFile = newFile +newFile > } > > write newFile; Yes, that certainly seems a sensible approach. I actually had the same idea (see below, "doubling the file until I have enough (cat file.txt >> file.txt)). Well, it's almost the same: in your suggestion the file is built up completely in memory and only written out in the end. That's also an option, but I think it's better to write it out every time it's doubled (to avoid using up too much memory). But good suggestion anyway. I think I will go with some variation of this (starting with a large enough chunk of representative content). Cheers, Johan > On 23/12/2010, at 11:51 AM, Johan Corveleyn wrote: > >> On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin >> <philip.mar...@wandisco.com> wrote: >>> Johan Corveleyn <jcor...@gmail.com> writes: >>> >>>> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin >>>> <philip.mar...@wandisco.com> wrote: >>>>> Johan Corveleyn <jcor...@gmail.com> writes: >>>>> >>>>>> This makes the diff algorithm another 10% - 15% >>>>>> faster (granted, this was measured with my "extreme" testcase of a 1,5 >>>>>> Mb file (60000 lines), of which most lines are identical >>>>>> prefix/suffix). >>>>> >>>>> Can you provide a test script? Or decribe the test more fully, please. >>>> >>>> Hmm, it's not easy to come up with a test script to test this "from >>>> scratch" (unless with testing diff directly, see below). I test it >>>> with a repository (a dump/load of an old version of our production >>>> repository) which contains this 60000 line xml file (1,5 Mb) with 2272 >>>> revisions. >>>> >>>> I run blame on this file, over svnserve protocol on localhost (server >>>> running on same machine), with an svnserve built from Stefan^2's >>>> performance branch (with membuffer caching of full-texts, so server >>>> I/O is not the bottleneck). This gives me an easy way to call 2272 >>>> times diff on this file, and measure it (with the help of some >>>> instrumentation code in blame.c, see attachment). And it's >>>> incidentally the actual use case I first started out wanting to >>>> optimize (blame for large files with many revisions). >>> >>> Testing with real-world data is important, perhaps even more important >>> than artificial test data, but some test data would be useful. If you >>> were to write a script to generate two test files of size 100MB, say, >>> then you could use the tools/diff/diff utility to run Subversion diff on >>> those two files. Or tools/diff/diff3 if it's a 3-way diff that matters. >>> The first run might involve disk IO, but on most machines the OS should >>> be able to cache the files and subsequent hot-cache runs should be a >>> good way to profile the diff code, assumming it is CPU limited. >> >> Yes, that's a good idea. I'll try to spend some time on that. But I'm >> wondering about a good way to write such a script. >> >> I'd like the script to generate large files quickly, and with content >> that's not totally random, but also not 1000000 times the exact same >> line (either of those are not going to be representative for real >> world data, might hit some edge behavior of the diff algorithm). >> (maybe totally random is fine, but is there an easy/fast way to >> generate this?) >> >> As a first attempt, I quickly hacked up a small shell script, writing >> out lines in a for loop, one by one, with a fixed string together with >> the line number (index of the iteration). But that's too slow (10000 >> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds). >> >> Maybe I can start with 10 or 20 different lines (or generate 100 in a >> for loop), and then start doubling that until I have enough (cat >> file.txt >> file.txt). That will probably be faster. And it might be >> "real-worldish" enough (a single source file also contains many >> identical lines, e.g. all lines with a single brace etc.). >> >> Other ideas? Maybe there is already something like this lying around? >> >> Another question: a shell script might not be good, because not >> portable (and not fast)? Should I use python for this? Maybe the >> "write line by line with a line number in a for loop" would be a lot >> faster in Python? I don't know a lot of python, but it might be a good >> opportunity to learn some ... >> >> Are there any examples of such "manual test scripts" in svn? So I >> could have a look at the style, coding habits, ... maybe borrow some >> boilerplate code? >> >> Cheers, >> -- >> Johan > >