Karl, Thankyou for your in-depth reply. This has given me good grounds to go on.
Regards Glyn 2009/4/6 Karl Wettin <karl.wet...@gmail.com>: > 6 apr 2009 kl. 14.59 skrev Glyn Darkin: > > Hi Glyn, > >> to be able to spell check phrases >> >> E.g >> >> "Harry Poter" is converted to "Harry Potter" >> >> We have a fixed dataset so can build indexes/ dictionaries from our >> own data. > > the most obvious solution is index your contrib/spell checker with shingles. > This will however probably only help out with exact phrases. Perhaps that is > enough for you. > > If your example is a real one that you came up with by analyzing query logs > then you might want consider creating an index "stemmed" to handle various > problems associated with reading and writing disorders. Dyslectic people > often miss out on vowels, they who suffer from dysgraphia have problems with > q/p/d/b, other have problems with reoccuring characters, et c. A combination > of these problems could end up in a secondary "fuzzy" index that contains > weighted shingles like this for the document that points at "harry potter": > > "hary poter"^0.9 > "harry #otter"^0.8 > "hary #oter"^0.7 > "hrry pttr"^0.7 > "hry ptr"^0.5 > > In order to get a good precision/recall your query to such an index would > have to produce a boolean query containing all of the "stems" above if the > input was spelled correct. > > > One alternative to the contrib/spell checker is Spelt: > http://groups.google.com/group/spelt/ and I believe it is supposed to handle > phrases. > > > Note the difference between spell checking and suggestion schemes. Something > can be wrong even though the spelling is correct. Consider the game "Heroes > of might and magic", people might have fogotten what it was called and > search for "Heroes of light and magic" instead. Hopefully your query would > still yield a fairly good result for the correct document if the latter was > entered, but if you require all terms or something similar then it might > return no hits. > > > More advanced strategies for contextual spell checking of phrases usually > involve statistical models such as neural networs, hidden markov models, et > c. LingPipe contains such an implementation. > > > You can also take a look at reinforcement learning, learning from the > misstakes and corrections made by your users. It requires a lot of data > (user query logs) in order to work but will yeild very cool results such as > acronyms. > > LUCENE-626 is a multi layered spell checker with reinforcement learning in > the top, backed by an a priori corpus (that can be compiled from old user > queries) used to find context. It also use a refactored version of the > contrib/spell checker as second level suggestion when there is nothing to > pick up from previous user behaviour. I never deployed this in a real > system, it does however seem to work great when trained with a few hundred > thousand query sessions. > > > Finally I recommend that you take some time to analyze user query sessions > to find what the most common problems your users have and try to find a > solution that best fit those problems. Too often features are implemented > because they are listed in a specification and not because the users need > them. > > > I hope this helps. > > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Glyn Darkin Darkin Systems Ltd Mob: 07961815649 Fax: 08717145065 Web: www.darkinsystems.com Company No: 6173001 VAT No: 906350835 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org