Greetings again, I've done some additional stress testing, this time with 25 MB of text, just to see if my previously derived rule of them about the cost of windowing in NSP will hold true. I think the answer is that it generally does. My rule of thumb was that time and memory will increase linearly with the increase in the size of the window.
So if your window is size N, then it will take about N/2 times as much time and memory as when are not using windowing (--window size of 2 with bigrams). So, if you have a window size of 10, you should expect it to use about 5 times as much time and space. We can actually see that it's not quite that precise, so if you expect an increase of N/2 to N times as long and as much space you should be fine. For the 25MB data, there are 25,855,406 words according to wc. According to NSP with --ngram 1, there are 25,863,165 tokens and 191,580 types... I ran window with sizes 2, 3, 5, and 10, and kept track of the time and memory used... window size = 2 Total Elapsed Time = 932.79 Seconds Memory Utilization = 1,300 MB Total Bigrams = 25,863,164 Unique Bigrams = 3,356,576 window size = 3 Total Elapsed Time = 1452.03 Seconds Memory Utilization = 2,800 MB Total Bigrams = 51,726,327 Unique Bigrams = 7,419,894 window size = 5 Total Elapsed Time = 2610.12 Seconds Memory Utilization = 5,500 MB Total Bigrams = 103,452,650 Unique Bigrams = 14,746,945 window size = 10 Total Elapsed Time = 5334.88 Seconds Memory Utilization = 11,000 MB Total Bigrams = 232,768,439 Unique Bigrams = 28,843,560 You can see here a fairly consistent linear growth pattern as the window size increases, which would allow us to predict, for example, that a window size of 25 would result take about 11,650 seconds and use 16,250 MB of memory. (Take time and space of window size 2 and multiple by 12.5). For a window size of 50 we'd end up with 23,330 seconds, and 32,500 MB of memory. I might actually go ahead and multiple by 25 and 50 as well, just to get an idea on an upper bound. So, while the increase is linear, it becomes pretty expensive just because we start off using a good bit of time and space. But, then if you look at the total number of bigrams as the window size increases, we are in fact doing a lot more counting, so these increases shouldn't be altogether surprising. Bottom line? Before running experiments with larger windowing sizes, run NSP without windowing (--ngram 2) and then project out time and space to make sure that is realistic for whatever hardware you are using! Enjoy, Ted On Thu, Oct 22, 2009 at 10:07 PM, Ted Pedersen <tpede...@d.umn.edu> wrote: > Greetings all, > > I've been doing some stress testing of NSP lately, focusing in > particular on figuring out the cost of "windowing", which allows us to > count bigrams that allow some number of intervening words to appear > between them (rather than just requiring them to be adjacent). This is > provided by the --window option in count.pl > > Before I get into the results, here is some info about the hardware > (so you can compare to your own setup): > > Dell Precision 670 with 12 GB of RAM and 2 dual processor Xeons, each > with a 3.00 GhZ clock rate. Note that NSP only uses one core. > uname -a output is : Linux marimba 2.6.24-23-server #1 SMP Wed Apr 1 > 22:14:30 UTC 2009 x86_64 GNU/Linux > > I used part of one directory of the English Gigaword data and created > a single file of input where all the markup was removed and the text > was converted to upper case. So, this is newspaper text essentiallly. > > I ran a series of commands of this form... > > perl -d:Dprof count.pl --ngram 2 --window X windowX.out nsp-stress-test.txt > > ...meaning that we count all bigrams using a window size of X. Recall > that the window size indicates how many words can occur between two > words that are forming a bigram. By default the window size is 2, > meaning that the bigrams must be adjacent. If the window size is 3, > then there may be up to 1 intervening words between two words in a > bigram, and so forth. > > Below is the script that I used to run these - note that the main > variable is the window size - the point of this exercise really was > to get a sense of how much windowing costs. > > ======================================= > > #!/bin/bash -l > > for value in 2 3 5 10 25 50 > > do > echo "window $value remove $value..." > perl -d:DProf /usr/local/bin/count.pl --window $value --ngram > 2 --remove $value window$value.out nsp-stress-test.txt > dprofpp > window$value.prof.out > done > > ============================================================================== > > In this case, nsp-stress-test.txt (11 MB) consists of the following > (btw, this is a small file but I wanted to get started somewhere...: > > 1,894,410 words of text (tokens) > 44,919 unique words (types) > > Here's what we see in terms of time and space as window size increases... > > -------------------- > > window size = 2 > Total Elapsed Time = 72.63075 Seconds > Memory Utilization = 240 MB > Total Bigrams = 1,894,409 > Unique Bigrams = 523,053 > > window size = 3 > Total Elapsed Time = 121.7461 Seconds > Memory Utilization = 490 MB > Total Bigrams = 3,788,817 > Unique Bigrams = 1,129,153 > > window size = 5 > Total Elapsed Time = 217.5907 Seconds > Memory Utilization = 960 MB > Total Bigrams = 7,577,630 > Unique Bigrams = 2,264,646 > > window size = 10 > Total Elapsed Time = 450.1853 Seconds > Memory Utilization = 1900 MB > Total Bigrams = 17,049,645 > Unique Bigrams = 4,582,673 > > window size = 25 > Total Elapsed Time = 1092.680 Seconds > Memory Utilization = 3800 MB > Total Bigrams = 45,465,540 > Unique Bigrams = 9,649,220 > > window size = 50 > Total Elapsed Time = 2067.010 Seconds > Memory Utilization = 6000 MB > Total Bigrams = 92,824,865 > Unique Bigrams = 15,673,225 > > A few summarizing stats... > > window size 50 takes 28.7 times as long to run as window size 2 > window size 50 takes 25 times as much memory as window size 2. > > window size 50 has 30 times as many unique bigrams as window size 2 > window size 50 has 48.9 times as many total bigrams as window size 2 > > The good news is that the increase in time and memory use is linear with > the window size. The bad news is that the memory footprint starts off > fairly large so even a linear increase can lead to fairly high > utilization. > > A rough rule of thumb then is that as window size increases, time and > space requirements will expand by the amount of the window size > increase. So going from window size of 2 to 25 will result in a 25x > increase in time and memory. > > Now, this is all based on just a single file and a relatively small set of > results, so my next step is to do this with a larger data file and see > if the above rules of thumb continue to hold... > > More soon, > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > -- Ted Pedersen http://www.d.umn.edu/~tpederse