David Sanders <[EMAIL PROTECTED]> writes: > The data files are large (~100 million lines), and this code takes a > long time to run (compared to just doing wc -l, for example).
wc is written in carefully optimized C and will almost certainly run faster than any python program. > Am I doing something very inefficient? (Any general comments on my > pythonic (or otherwise) style are also appreciated!) Is > "line.split()" efficient, for example? Your implementation's efficiency is not too bad. Stylistically it's not quite fluent but there's nothing to really criticize--you may develop a more concise style with experience, or maybe not. One small optimization you could make is to use collections.defaultdict to hold the counters instead of a regular dict, so you can get rid of the test for whether a key is in the dict. Keep an eye on your program's memory consumption as it runs. The overhead of a pair of python ints and a dictionary cell to hold them is some dozens of bytes at minimum. If you have a lot of distinct keys and not enough memory to hold them all in the large dict, your system may be thrashing. If that is happening, the two basic solutions are 1) buy more memory; or, 2) divide the input into smaller pieces, attack them separately, and merge the results. If I were writing this program and didn't have to run it too often, I'd probably use the unix "sort" utility to sort the input (that utility does an external disk sort if the input is large enough to require it) then make a single pass over the sorted list to count up each group of keys (see itertools.groupby for a convenient way to do that). -- http://mail.python.org/mailman/listinfo/python-list