Greetings all,

I've been doing some stress testing of NSP lately, focusing in
particular on figuring out the cost of "windowing", which allows us to
count bigrams that allow some number of intervening words to appear
between them (rather than just requiring them to be adjacent). This is
provided by the --window option in count.pl

Before I get into the results, here is some info about the hardware
(so you can compare to your own setup):

Dell Precision 670 with 12 GB of RAM and 2 dual processor Xeons, each
with a 3.00 GhZ clock rate. Note that NSP only uses one core.
uname -a output is : Linux marimba 2.6.24-23-server #1 SMP Wed Apr 1
22:14:30 UTC 2009 x86_64 GNU/Linux

I used part of one directory of the English Gigaword data and created
a single file of input where all the markup was removed and the text
was converted to upper case. So, this is newspaper text essentiallly.

I ran a series of commands of this form...

perl -d:Dprof count.pl --ngram 2 --window X  windowX.out nsp-stress-test.txt

...meaning that we count all bigrams using a window size of X. Recall
that the window size indicates how many words can occur between two
words that are forming a bigram. By default the window size is 2,
meaning that the bigrams must be adjacent. If the window size is 3,
then there may be up to 1 intervening words between two words in a
bigram, and so forth.

Below is the script that I used to run these - note that the main
variable is  the window size - the point of this exercise really was
to get a sense of how much windowing costs.

=======================================

#!/bin/bash -l

for value in 2 3 5 10 25 50

do
        echo "window $value remove $value..."
        perl -d:DProf /usr/local/bin/count.pl --window $value --ngram
2 --remove  $value window$value.out nsp-stress-test.txt
        dprofpp > window$value.prof.out
done

==============================================================================

In this case, nsp-stress-test.txt (11 MB) consists of the following
(btw, this is a small file but I wanted to get started somewhere...:

1,894,410 words of text (tokens)
44,919 unique words (types)

Here's what we see in terms of time and space as window size increases...

--------------------

window size = 2
Total Elapsed Time = 72.63075 Seconds
Memory Utilization = 240 MB
Total Bigrams = 1,894,409
Unique Bigrams = 523,053

window size = 3
Total Elapsed Time = 121.7461 Seconds
Memory Utilization = 490 MB
Total Bigrams = 3,788,817
Unique Bigrams = 1,129,153

window size = 5
Total Elapsed Time = 217.5907 Seconds
Memory Utilization = 960 MB
Total Bigrams = 7,577,630
Unique Bigrams = 2,264,646

window size = 10
Total Elapsed Time = 450.1853 Seconds
Memory Utilization = 1900 MB
Total Bigrams = 17,049,645
Unique Bigrams = 4,582,673

window size = 25
Total Elapsed Time = 1092.680 Seconds
Memory Utilization = 3800 MB
Total Bigrams = 45,465,540
Unique Bigrams = 9,649,220

window size = 50
Total Elapsed Time = 2067.010 Seconds
Memory Utilization = 6000 MB
Total Bigrams = 92,824,865
Unique Bigrams = 15,673,225

A few summarizing stats...

window size 50 takes 28.7 times as long to run as window size 2
window size 50 takes 25 times as much memory as window size 2.

window size 50 has 30 times as many unique bigrams as window size 2
window size 50 has 48.9 times as many total bigrams as window size 2

The good news is that the increase in time and memory use is linear with
the window size. The bad news is that the memory footprint starts off
fairly large so even a linear increase can lead to fairly high
utilization.

A rough rule of thumb then is that as window size increases, time and
space requirements will expand by the amount of the window size
increase. So going from window size of 2 to 25 will result in a 25x
increase in time and memory.

Now, this is all based on just a single file and a relatively small set of
results, so my next step is to do this with a larger data file and see
if the above rules of thumb continue to hold...

More soon,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to