Greetings again,

I've done some additional stress testing, this time with 25 MB of
text, just to see if my previously derived rule of them about the cost
of windowing in NSP will hold true. I think the answer is that it
generally does. My rule of thumb was that time and memory will
increase linearly with the increase in the size of the window.

So if your window is size N, then it will take about N/2 times as much
time and memory as when are not using windowing (--window size of 2
with bigrams). So, if you have a window size of 10, you should expect
it to use about 5 times as much time and space. We can actually see
that it's not quite that precise, so if you expect an increase of N/2
to N times as long and as much space you should be fine.

For the 25MB data, there are 25,855,406 words according to wc.
According to NSP with --ngram 1, there are 25,863,165 tokens and
191,580 types...

I ran window with sizes 2, 3, 5, and 10, and kept track of the time
and memory used...

window size = 2
Total Elapsed Time = 932.79 Seconds
Memory Utilization = 1,300 MB
Total Bigrams = 25,863,164
Unique Bigrams =  3,356,576

window size = 3
Total Elapsed Time = 1452.03 Seconds
Memory Utilization = 2,800 MB
Total Bigrams = 51,726,327
Unique Bigrams = 7,419,894

window size = 5
Total Elapsed Time = 2610.12 Seconds
Memory Utilization = 5,500 MB
Total Bigrams = 103,452,650
Unique Bigrams = 14,746,945

window size = 10
Total Elapsed Time = 5334.88 Seconds
Memory Utilization = 11,000 MB
Total Bigrams = 232,768,439
Unique Bigrams = 28,843,560

You can see here a fairly consistent linear growth pattern as the
window size increases, which would allow us to predict, for example,
that a window size of 25 would result take about 11,650 seconds and
use 16,250 MB of memory. (Take time and space of window size 2 and
multiple by 12.5). For a window size of 50 we'd end up with 23,330
seconds, and 32,500 MB of memory. I might actually go ahead and
multiple by 25 and 50 as well, just to get an idea on an upper bound.

So, while the increase is linear, it becomes pretty expensive just
because we start off using a good bit of time and space. But, then if
you look at the total number of bigrams as the window size increases,
we are in fact doing a lot more counting, so these increases shouldn't
be altogether surprising.

Bottom line? Before running experiments with larger windowing sizes,
run NSP without windowing (--ngram 2) and then project out time and
space to make sure that is realistic for whatever hardware you are
using!

Enjoy,
Ted

On Thu, Oct 22, 2009 at 10:07 PM, Ted Pedersen <tpede...@d.umn.edu> wrote:
> Greetings all,
>
> I've been doing some stress testing of NSP lately, focusing in
> particular on figuring out the cost of "windowing", which allows us to
> count bigrams that allow some number of intervening words to appear
> between them (rather than just requiring them to be adjacent). This is
> provided by the --window option in count.pl
>
> Before I get into the results, here is some info about the hardware
> (so you can compare to your own setup):
>
> Dell Precision 670 with 12 GB of RAM and 2 dual processor Xeons, each
> with a 3.00 GhZ clock rate. Note that NSP only uses one core.
> uname -a output is : Linux marimba 2.6.24-23-server #1 SMP Wed Apr 1
> 22:14:30 UTC 2009 x86_64 GNU/Linux
>
> I used part of one directory of the English Gigaword data and created
> a single file of input where all the markup was removed and the text
> was converted to upper case. So, this is newspaper text essentiallly.
>
> I ran a series of commands of this form...
>
> perl -d:Dprof count.pl --ngram 2 --window X  windowX.out nsp-stress-test.txt
>
> ...meaning that we count all bigrams using a window size of X. Recall
> that the window size indicates how many words can occur between two
> words that are forming a bigram. By default the window size is 2,
> meaning that the bigrams must be adjacent. If the window size is 3,
> then there may be up to 1 intervening words between two words in a
> bigram, and so forth.
>
> Below is the script that I used to run these - note that the main
> variable is  the window size - the point of this exercise really was
> to get a sense of how much windowing costs.
>
> =======================================
>
> #!/bin/bash -l
>
> for value in 2 3 5 10 25 50
>
> do
>        echo "window $value remove $value..."
>        perl -d:DProf /usr/local/bin/count.pl --window $value --ngram
> 2 --remove  $value window$value.out nsp-stress-test.txt
>        dprofpp > window$value.prof.out
> done
>
> ==============================================================================
>
> In this case, nsp-stress-test.txt (11 MB) consists of the following
> (btw, this is a small file but I wanted to get started somewhere...:
>
> 1,894,410 words of text (tokens)
> 44,919 unique words (types)
>
> Here's what we see in terms of time and space as window size increases...
>
> --------------------
>
> window size = 2
> Total Elapsed Time = 72.63075 Seconds
> Memory Utilization = 240 MB
> Total Bigrams = 1,894,409
> Unique Bigrams = 523,053
>
> window size = 3
> Total Elapsed Time = 121.7461 Seconds
> Memory Utilization = 490 MB
> Total Bigrams = 3,788,817
> Unique Bigrams = 1,129,153
>
> window size = 5
> Total Elapsed Time = 217.5907 Seconds
> Memory Utilization = 960 MB
> Total Bigrams = 7,577,630
> Unique Bigrams = 2,264,646
>
> window size = 10
> Total Elapsed Time = 450.1853 Seconds
> Memory Utilization = 1900 MB
> Total Bigrams = 17,049,645
> Unique Bigrams = 4,582,673
>
> window size = 25
> Total Elapsed Time = 1092.680 Seconds
> Memory Utilization = 3800 MB
> Total Bigrams = 45,465,540
> Unique Bigrams = 9,649,220
>
> window size = 50
> Total Elapsed Time = 2067.010 Seconds
> Memory Utilization = 6000 MB
> Total Bigrams = 92,824,865
> Unique Bigrams = 15,673,225
>
> A few summarizing stats...
>
> window size 50 takes 28.7 times as long to run as window size 2
> window size 50 takes 25 times as much memory as window size 2.
>
> window size 50 has 30 times as many unique bigrams as window size 2
> window size 50 has 48.9 times as many total bigrams as window size 2
>
> The good news is that the increase in time and memory use is linear with
> the window size. The bad news is that the memory footprint starts off
> fairly large so even a linear increase can lead to fairly high
> utilization.
>
> A rough rule of thumb then is that as window size increases, time and
> space requirements will expand by the amount of the window size
> increase. So going from window size of 2 to 25 will result in a 25x
> increase in time and memory.
>
> Now, this is all based on just a single file and a relatively small set of
> results, so my next step is to do this with a larger data file and see
> if the above rules of thumb continue to hold...
>
> More soon,
> Ted
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to