On Mon, 13 Mar 2006, Bryan Sant wrote:
Here are the current results for the "count the dictionary words in a
file" programs submitted thus far.
Would you run mine, please? To be fair, we should probably standardize on a
word list and test document. (And for my algorithm to perform well, we should
definitely use a larger input text file.)
My entry is here, including source code, executable (since compiling it may be
problematic for many people), word list and kjv10, the KJV bible from project
Gutenberg (823,156 words):
http://lunkwill.org/src/dwords.tar.gz
[EMAIL PROTECTED] ~$ ll dwords.c
-rw-r--r-- 1 jason jason 63372299 Mar 13 17:05 dwords.c
I think I generated words.i with sort -f | uniq -i /usr/share/dict/words. It
includes only one copy of each word (ignoring case). My entry is also case
insensitive.
The base .c file was generated with gperf, a really neat utility for
generating perfect (and often minimal) hash functions:
[EMAIL PROTECTED] ~$ time nice cat words.i |gperf -s 10 --output-file dwords.c \
--ignore-case -r
real 1010m33.269s
user 987m55.696s
sys 6m25.587s
(1010 minutes is 16.8 hours).
Nobody volunteered to compile with gcc, but I found that tcc compiles it using
around 2GB RAM. I'd still be interested to see gcc -O4 results.
Here's the runtime on my Athlon64 3000+:
[EMAIL PROTECTED] ~$ time ./a.out changelog >/dev/null
real 0m0.326s
user 0m0.270s
sys 0m0.056s
[EMAIL PROTECTED] ~$ time ./a.out kjv10 >/dev/null
real 0m0.829s
user 0m0.747s
sys 0m0.081s
To get an idea of how much of that is overhead for the 64MB executable, I made
a file 10x larger:
[EMAIL PROTECTED] ~$ cat kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 \
kjv10 >kjv10x10
[EMAIL PROTECTED] ~$ time ./a.out kjv10x10 >/dev/null
real 0m3.536s
user 0m3.375s
sys 0m0.119s
And then 100x larger:
[EMAIL PROTECTED] ~$ cat kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10
kjv10x10 kjv10x10 kjv10x10 kjv10x10 >kjv10x100
[EMAIL PROTECTED] ~$ ll kjv10x100
-rw-r--r-- 1 jason jason 444526000 Mar 13 17:25 kjv10x100
[EMAIL PROTECTED] ~$ time ./a.out kjv10x100 >/dev/null
real 0m30.041s
user 0m29.617s
sys 0m0.339s
So, the algorithm still wasn't hitting its stride with the 44MB file, but it
starts to get pretty close to linear runtime at the 440MB mark:
[EMAIL PROTECTED] ~$ cat kjv10x100 kjv10x100 >kjv10x200
[EMAIL PROTECTED] ~$ time ./a.out kjv10x200 >/dev/null
real 0m58.551s
user 0m57.645s
sys 0m0.592s
Actually, I'm not quite sure where the extra overhead is coming from, since
runtime for small files is only about .3 seconds. I would have expected more
or less linear increases in runtime for any job taking more than a few
seconds; any ideas why it seems to get faster even for very large files?
-J
/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/