Greetings all, I've been doing some stress testing of NSP lately, focusing in particular on figuring out the cost of "windowing", which allows us to count bigrams that allow some number of intervening words to appear between them (rather than just requiring them to be adjacent). This is provided by the --window option in count.pl
Before I get into the results, here is some info about the hardware (so you can compare to your own setup): Dell Precision 670 with 12 GB of RAM and 2 dual processor Xeons, each with a 3.00 GhZ clock rate. Note that NSP only uses one core. uname -a output is : Linux marimba 2.6.24-23-server #1 SMP Wed Apr 1 22:14:30 UTC 2009 x86_64 GNU/Linux I used part of one directory of the English Gigaword data and created a single file of input where all the markup was removed and the text was converted to upper case. So, this is newspaper text essentiallly. I ran a series of commands of this form... perl -d:Dprof count.pl --ngram 2 --window X windowX.out nsp-stress-test.txt ...meaning that we count all bigrams using a window size of X. Recall that the window size indicates how many words can occur between two words that are forming a bigram. By default the window size is 2, meaning that the bigrams must be adjacent. If the window size is 3, then there may be up to 1 intervening words between two words in a bigram, and so forth. Below is the script that I used to run these - note that the main variable is the window size - the point of this exercise really was to get a sense of how much windowing costs. ======================================= #!/bin/bash -l for value in 2 3 5 10 25 50 do echo "window $value remove $value..." perl -d:DProf /usr/local/bin/count.pl --window $value --ngram 2 --remove $value window$value.out nsp-stress-test.txt dprofpp > window$value.prof.out done ============================================================================== In this case, nsp-stress-test.txt (11 MB) consists of the following (btw, this is a small file but I wanted to get started somewhere...: 1,894,410 words of text (tokens) 44,919 unique words (types) Here's what we see in terms of time and space as window size increases... -------------------- window size = 2 Total Elapsed Time = 72.63075 Seconds Memory Utilization = 240 MB Total Bigrams = 1,894,409 Unique Bigrams = 523,053 window size = 3 Total Elapsed Time = 121.7461 Seconds Memory Utilization = 490 MB Total Bigrams = 3,788,817 Unique Bigrams = 1,129,153 window size = 5 Total Elapsed Time = 217.5907 Seconds Memory Utilization = 960 MB Total Bigrams = 7,577,630 Unique Bigrams = 2,264,646 window size = 10 Total Elapsed Time = 450.1853 Seconds Memory Utilization = 1900 MB Total Bigrams = 17,049,645 Unique Bigrams = 4,582,673 window size = 25 Total Elapsed Time = 1092.680 Seconds Memory Utilization = 3800 MB Total Bigrams = 45,465,540 Unique Bigrams = 9,649,220 window size = 50 Total Elapsed Time = 2067.010 Seconds Memory Utilization = 6000 MB Total Bigrams = 92,824,865 Unique Bigrams = 15,673,225 A few summarizing stats... window size 50 takes 28.7 times as long to run as window size 2 window size 50 takes 25 times as much memory as window size 2. window size 50 has 30 times as many unique bigrams as window size 2 window size 50 has 48.9 times as many total bigrams as window size 2 The good news is that the increase in time and memory use is linear with the window size. The bad news is that the memory footprint starts off fairly large so even a linear increase can lead to fairly high utilization. A rough rule of thumb then is that as window size increases, time and space requirements will expand by the amount of the window size increase. So going from window size of 2 to 25 will result in a 25x increase in time and memory. Now, this is all based on just a single file and a relatively small set of results, so my next step is to do this with a larger data file and see if the above rules of thumb continue to hold... More soon, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse