Jonathan Ellis wrote:
#!/usr/bin/python2.4

import sys

WORDS_FNAME = '/usr/share/dict/words'
words = dict((line.rstrip(), 0) for line in file(WORDS_FNAME))

source = len(sys.argv) > 1 and file(sys.argv[1]) or sys.stdin
for line in source:
    for word in line.strip().split():
        if not word:
            continue
        try:
            words[word] += 1
        except KeyError:
            pass

for word, count in words.iteritems():
    if count > 0:
        print word, count

Here's a version of that code that runs in about 40% less time. Optimizations:

- Locals are faster than globals, so I moved the code into a function.

- Since we don't report words not used in the document, and a given document is likely to use only a small fraction of the dictionary, it's better to build a new dictionary of word frequency rather than filter the larger dictionary later.

- There are several possible ways to copy words in a file to the keys of a dictionary; this way ran faster than the other ways.

- There's no need to strip() a line if all you're going to do with it is split() it.

- Exception handling is more expensive than conditions, so switched to "if word in words".

I though of another tiny optimization that shaves around .5%, but reduces readability: pre-fetch the .get attribute of the 'freq' dictionary. I doubt it's worth the penalty.

Shane


#!/usr/bin/python2.4

import sys

def main():
    words_fname = '/usr/share/dict/words'
    if len(sys.argv) > 1:
        source = open(sys.argv[1])
    else:
        source = sys.stdin

    words = {}
    for line in open(words_fname):
        words[line.rstrip()] = 0

    freq = {}
    for line in source:
        for word in line.split():
            if word in words:
                freq[word] = freq.get(word, 0) + 1

    for word, count in freq.iteritems():
        print word, count

main()

/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to