W dniu 2014-01-02 18:30, Jaume Ortolà i Font pisze: > 2014/1/2 Marcin Miłkowski <list-addr...@wp.pl <mailto:list-addr...@wp.pl>> > > I did look at them again and it turns out that most only slowed down the > process but did not help in creating good suggestions as short fragments > were not put together the way they should. So I went ahead and removed a > fair number of the replacement pairs but we still need to find ways to > limit search if we have too many replacement pairs. > > > That's easy, but the search will be not exhaustive... > > I don't think traversing the automaton directly would be faster because > we stop searching the automaton early anyway (as soon as the letter is > not found). > > > I don't think so. Precisely when the algorithm is looking at the first > letters ("Mi.. ") there are many words to search in the dictionary (with > distance=1: ma..., me..., mi...; ai..., bi..., ci..., ... zi...). As you > progress ("Milkow..."), there are less and less words. So every time we > start again the search with a new candidate I think we lose a lot of > time. Traversing the automaton should be quite faster.
Yes, you are right. I did some profiling and indeed, our main problem is that findRepl on line 306 in Speller.java is run zillions of times. One easy but brutal fix would be to run findRepl only on the original word and use replacement pairs only to generate complete candidates. This would save a lot of time but on the pain of quality. Maybe it would be better before we have a proper traversal routine? The only way to find out whether the quality of suggestions really drops is to run it on common misspellings with and without findRepl on wordsToCheck list. Could you make this experiment on Catalan data? > > The difficulty in traversing the automaton appears when replacing pairs > with different lengths (one to two, one to three, two to three...). The > management of the matrix where the distances are stored becomes tricky. I can see the problem, and this is why there was an upper limit of replacement lengths in the original algorithm (in fsa_spell by Jan Daciuk). This is actually quite tricky to solve. > When I have some time, I could try again... > > I'm afraid the truth is that we have to be very careful when > adding replacement pairs, as completely ignoring one-letter replacements > is not a good idea (at least not for Polish, in which people confuse "h" > and "ch"). > > > If the replacement h->ch can happen in any context or in most > contexts, then the one-letter replacement is unavoidable. Exactly. Regards, Marcin ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel