On Tuesday 26 September 2006 16:44, Lyle Kopnicky wrote: > Hi folks, > > I'm competing in a contest at work, and we're allowed to use whatever > language we want. I decided this was my chance to prove to people that > Haskell was up to the challenge. Unfortunately, I ran into performance > problems. Since the contest ends this Friday, I've decided to switch to > C++ (gasp!). But if any of you have advice on how to speed up this code, > it could help me advocate Haskell in the future. > > It's supposed to match movie titles from an imported database to a > reference database. The version I've sent doesn't do anything very smart > - it's just doing literal title matches. The first argument to the > program is the filename of the table to be imported, and the second is > the filename of the reference table. The first line of each table is a > pipe-separated list of field names; the rest of the lines are records, > each a pipe-separated list of values. > > The import files each have 3,000 records, and the reference table has > 137,986 records. > > Building the hash tables out of the files is quick - it just takes a few > seconds. But doing the matching of title_id in one table to title_id in > the other, in a nested loop between both tables, takes way too long. > It's matching two import titles (against each of the reference titles) > per second. It needs to do at least 20 per second to qualify for the > contest, and it's not doing anything fancy yet.
Humm... well, double nested loops seems like the wrong approach. Also, if you are using GHC, it's hashtable implementation has farily well-known performance problems. If all you care about is exact matching, then the operation is essentially a finite map intersection (if I've understood what you are trying to do). This is just a guess, but I suspect you will probably get much better performance (and better-looking code!) by just using Data.Map.intersection http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data-Map.html#v%3Aintersection Alternately, there is the ternary trie implementation from Edison (http://www.eecs.tufts.edu/~rdocki01) that may also work for you. If you need to do prefix matching, then a trie is the way to go. You can probably code up a nice prefix-intersection operation using tries that should go pretty fast. If you have some other metric other than prefix in mind for partial matches, then things probably get a lot more complicated. You're probably looking at calculating minimum distances in some feature-space, which calls for pretty sophisticated algorithms if you need good performance. > I tried various "improvements" to speed it up. One was to specifically > use ByteString, eliminating the AbsString class. Didn't make a > difference. Another was to use arrays instead of lists to store each > record, and precompute the indices of each of the fields within those > records. I also iterated over a list of keys instead of the list of > Maps, and only converted each record to a Map one at a time, hoping they > would be disposed of sooner. Instead of speeding up the program, this > slowed it down by a factor of 20! > > I've profiled it, and I can't make much out of that. It seemed to be > spending 25% of its time doing scoring, and I though the problem must be > due to laziness, but I'm not sure. > > So if anyone has any ideas how to speed this up by a factor of at least > 10 times, it would be really appreciated! Even the Ruby solutions are > doing that, which is embarrassing. > > Thanks, > Lyle -- Rob Dockins Talk softly and drive a Sherman tank. Laugh hard, it's a long way to the bank. -- TMBG _______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
