Re: Current results of "dictionary word count" programs...

Jason Holt Mon, 13 Mar 2006 16:19:54 -0800


On Mon, 13 Mar 2006, Bryan Sant wrote:

Here are the current results for the "count the dictionary words in a
file" programs submitted thus far.

Would you run mine, please? To be fair, we should probably standardize on aword list and test document. (And for my algorithm to perform well, we shoulddefinitely use a larger input text file.)

My entry is here, including source code, executable (since compiling it may beproblematic for many people), word list and kjv10, the KJV bible from projectGutenberg (823,156 words):


http://lunkwill.org/src/dwords.tar.gz

[EMAIL PROTECTED] ~$ ll dwords.c
-rw-r--r--  1 jason jason 63372299 Mar 13 17:05 dwords.c

I think I generated words.i with sort -f | uniq -i /usr/share/dict/words. Itincludes only one copy of each word (ignoring case). My entry is also caseinsensitive.

The base .c file was generated with gperf, a really neat utility forgenerating perfect (and often minimal) hash functions:


[EMAIL PROTECTED] ~$ time nice cat words.i |gperf -s 10 --output-file dwords.c \
--ignore-case -r

real    1010m33.269s
user    987m55.696s
sys     6m25.587s

(1010 minutes is 16.8 hours).

Nobody volunteered to compile with gcc, but I found that tcc compiles it usingaround 2GB RAM. I'd still be interested to see gcc -O4 results.


Here's the runtime on my Athlon64 3000+:

[EMAIL PROTECTED] ~$ time ./a.out changelog >/dev/null

real    0m0.326s
user    0m0.270s
sys     0m0.056s

[EMAIL PROTECTED] ~$ time ./a.out kjv10  >/dev/null

real    0m0.829s
user    0m0.747s
sys     0m0.081s

To get an idea of how much of that is overhead for the 64MB executable, I madea file 10x larger:


[EMAIL PROTECTED] ~$ cat kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 \
kjv10 >kjv10x10

[EMAIL PROTECTED] ~$ time ./a.out kjv10x10  >/dev/null

real    0m3.536s
user    0m3.375s
sys     0m0.119s

And then 100x larger:

[EMAIL PROTECTED] ~$ cat kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10kjv10x10 kjv10x10 kjv10x10 kjv10x10 >kjv10x100


[EMAIL PROTECTED] ~$ ll kjv10x100
-rw-r--r--  1 jason jason 444526000 Mar 13 17:25 kjv10x100

[EMAIL PROTECTED] ~$ time ./a.out kjv10x100  >/dev/null

real    0m30.041s
user    0m29.617s
sys     0m0.339s

So, the algorithm still wasn't hitting its stride with the 44MB file, but itstarts to get pretty close to linear runtime at the 440MB mark:


[EMAIL PROTECTED] ~$ cat kjv10x100 kjv10x100 >kjv10x200
[EMAIL PROTECTED] ~$ time ./a.out kjv10x200 >/dev/null

real    0m58.551s
user    0m57.645s
sys     0m0.592s

Actually, I'm not quite sure where the extra overhead is coming from, sinceruntime for small files is only about .3 seconds. I would have expected moreor less linear increases in runtime for any job taking more than a fewseconds; any ideas why it seems to get faster even for very large files?


                                                -J

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Re: Current results of "dictionary word count" programs...

Reply via email to