On Mon, 13 Mar 2006, Bryan Sant wrote:

Here are the current results for the "count the dictionary words in a
file" programs submitted thus far.

Would you run mine, please? To be fair, we should probably standardize on a word list and test document. (And for my algorithm to perform well, we should definitely use a larger input text file.)

My entry is here, including source code, executable (since compiling it may be problematic for many people), word list and kjv10, the KJV bible from project Gutenberg (823,156 words):

http://lunkwill.org/src/dwords.tar.gz

[EMAIL PROTECTED] ~$ ll dwords.c
-rw-r--r--  1 jason jason 63372299 Mar 13 17:05 dwords.c

I think I generated words.i with sort -f | uniq -i /usr/share/dict/words. It includes only one copy of each word (ignoring case). My entry is also case insensitive.

The base .c file was generated with gperf, a really neat utility for generating perfect (and often minimal) hash functions:

[EMAIL PROTECTED] ~$ time nice cat words.i |gperf -s 10 --output-file dwords.c \
--ignore-case -r

real    1010m33.269s
user    987m55.696s
sys     6m25.587s

(1010 minutes is 16.8 hours).

Nobody volunteered to compile with gcc, but I found that tcc compiles it using around 2GB RAM. I'd still be interested to see gcc -O4 results.

Here's the runtime on my Athlon64 3000+:

[EMAIL PROTECTED] ~$ time ./a.out changelog >/dev/null

real    0m0.326s
user    0m0.270s
sys     0m0.056s

[EMAIL PROTECTED] ~$ time ./a.out kjv10  >/dev/null

real    0m0.829s
user    0m0.747s
sys     0m0.081s


To get an idea of how much of that is overhead for the 64MB executable, I made a file 10x larger:

[EMAIL PROTECTED] ~$ cat kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 \
kjv10 >kjv10x10

[EMAIL PROTECTED] ~$ time ./a.out kjv10x10  >/dev/null

real    0m3.536s
user    0m3.375s
sys     0m0.119s

And then 100x larger:

[EMAIL PROTECTED] ~$ cat kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 >kjv10x100

[EMAIL PROTECTED] ~$ ll kjv10x100
-rw-r--r--  1 jason jason 444526000 Mar 13 17:25 kjv10x100

[EMAIL PROTECTED] ~$ time ./a.out kjv10x100  >/dev/null

real    0m30.041s
user    0m29.617s
sys     0m0.339s


So, the algorithm still wasn't hitting its stride with the 44MB file, but it starts to get pretty close to linear runtime at the 440MB mark:

[EMAIL PROTECTED] ~$ cat kjv10x100 kjv10x100 >kjv10x200
[EMAIL PROTECTED] ~$ time ./a.out kjv10x200 >/dev/null

real    0m58.551s
user    0m57.645s
sys     0m0.592s

Actually, I'm not quite sure where the extra overhead is coming from, since runtime for small files is only about .3 seconds. I would have expected more or less linear increases in runtime for any job taking more than a few seconds; any ideas why it seems to get faster even for very large files?

                                                -J

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to