Hey, you're using String I/O! nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")
This should be WORDS <- fmap (train . B.words) (B.readFile "big.txt") By the way, which exact file do you use as a misspellings file? The corpus linked to at Norvig's page has many. And do you have a driver program that I could run and obtain your timings? 2009/6/22 Kamil Dworakowski <ka...@dworakowski.name>: > Hi all, > > I want to learn how to optimize Haskell code on the Norvig's spelling > correction program [1]. Peter Norvig has written the program in > Python, > quite a literate translation to Haskell (thanks to Grzegorz Chrupala) > [2] was also available. > > Both versions are about 23 LOCs, and Haskell version is four times > slower, 2m25s versus 36s. All the numbers I give come from running the > program on a list of 806 misspellings, Haskell version compiled with - > O2 > and -fforce-recomp flags. > > I started with trial and error approach. Just preventing a repetitive > call to keysSet brought the running time down to 1m48s. I also > restructured the edits1 function and dropped all uses of sets, which > haven't been pulling their own weight. This brought the running time > down to 55s. Not quite there yet but close. > > Reading the Profiling chapter in the RWH book enabled me to do some > more > informed optimizing. At this point the GC time was about 2% of the > overall runnig time, Grzegorz has done a good job of using strict > versions of functions where necessary. The timing per expression > however > showed something useful, 70% of the time was being spent around > determining membership in the known words dictionary (Data.Map). There > are about 30 000 items in the dictionary. > > Right... Python uses hashtables while here I have a tree with log n > access time. I did not want to use the Data.HashTable, it would > pervade > my program with IO. The alternative is an ideal hashmap that never > gets > changed. This program creates a dictionary at start which then is only > being used to read from: an ideal application for the Data.PerfectHash > by Mark Wotton available on Hackage [3]. > > The PerfectHash is a dictionary keyed by ByteStrings, so I had to > migrate the program to use these instead of Strings. Sadly it > increased > the running time to 59s. Adding PerfectHash brought the running time > down to 39s. I have done some code restructuring with the view to > publishing it, which brought the running time up to 45s. The code is > available on patch-tag [4] for anyone to scrutinize and hopefully > point > out mistakes (it uses big.txt and misspellings file which are linked > to > from Norvig's page [1]). > > At this point [5] I don't know how to proceed. The program is still > slower than Python. The -ddump-simpl dump does not give me any clues, > there is too much of it and I don't understand most of it. The GC time > is about 10% of the total, and according to the profiler almost half > the > time is spent in the member function, and almost all the rest in > edits1. > Shouldn't it run much faster than the interpreted Python? > > Frankly, I am not impressed with the PerfectHash performance; It looks > as if it is only two times faster than Data.Map. Is this because of > the > number of elements? > > 1. http://www.norvig.com/spell-correct.html > 2. > http://pitekus.blogspot.com/2007/04/norvigs-spelling-corrector-in-haskell.html > 3. http://hackage.haskell.org/package/PerfectHash > 4. http://patch-tag.com/r/spellcorrect/ > 5. > http://patch-tag.com/r/spellcorrect/snapshot/hash/20090621192727-a99e5-fed48a7ccf07b79c572fddb0304648fe80d39904/content/pretty/SpellingCorrection.hs > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > -- Eugene Kirpichov Web IR developer, market.yandex.ru _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe