[ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wettin updated LUCENE-626: ------------------------------- Attachment: spellchecker.diff It uses the ngram spell checker for queries yet not corrected by users, but it handles more than one word at the time, and it inspects the term position vector if available. This way it can also rearange input to the most probable order. addDocument(indexWriter, field, "heroes of might and magic III complete"); addDocument(indexWriter, field, "it might be the best game ever made"); assertEquals("heroes of might and magic", suggester.didYouMean("hereos of magic and might")); assertEquals("heroes of might and magic", suggester.didYouMean("hereos of light and magic")); assertEquals("heroes might magic", suggester.didYouMean("magic light heros")); assertEquals("best game made", suggester.didYouMean("game best made")); assertEquals("game made", suggester.didYouMean("made game")); assertEquals("game made", suggester.didYouMean("made lame")); assertEquals(null, suggester.didYouMean("may game")); Once someone clicks on a suggestion (you have to report this back to the suggester) it will get a higher priority. If the person reports interest in one or many of the results in the followed suggested query, it will get an even higher priority. If something is suggested but not clicked on, then the priority will go down. When the priority reaches a lower threadshold, it will no loger be suggested, and the next best suggestion will appear. And so on. To change the query manually is the same thing as clicking on a suggestions, given it is similar enough and withing a certain timeframe. assertEquals("homm", suggester.didYouMean("heroes of might and magic")); assertEquals("heroes of might and magic", suggester.didYouMean("heroes of night and magic")); assertEquals("homm", suggester.didYouMean("heroes of might and magic")); assertEquals("heroes of might and magic", suggester.didYouMean("homm")); The data is stored in a Map<String /*query*/, List<Suggestion>>, and the default implementation strips the query from p{Punct}. That should help with composite and decomposite, amongst much. assertEquals("the da vinci code", suggester.didYouMean("thedavincicode")); assertEquals("the da vinci code", suggester.didYouMean("the dav-inci code")); assertEquals("heroes of might and magic", suggester.didYouMean("heroes ofnight andmagic")); It seems as the ngram spell check tests is broken - requires the removed class English. I've re-introduced it in Lucene-550. I will not work further on this patch and issue. It will be added to Lucene-550 for chaching and such. > Adaptive, user query session analyzing spell checker. > ----------------------------------------------------- > > Key: LUCENE-626 > URL: https://issues.apache.org/jira/browse/LUCENE-626 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Reporter: Karl Wettin > Priority: Minor > Attachments: spellchecker.diff > > > From javadocs: > This is an adaptive, user query session analyzing spell checker. In plain > words, a word and phrase dictionary that will learn from how users act while > searching. > Be aware, this is a beta version. It is not finished, but yeilds great > results if you have enough user activity, RAM and a faily narrow document > corpus. The RAM problem can be fixed if you implement your own subclass of > SpellChecker as the abstract methods of this class are the CRUD methods. This > will most probably change to a strategy class in future version. > TODO: > 1. Gram up results to detect compositewords that should not be composite > words, and vice verse. > 2. Train a gramed token (markov) chain with output from an expectation > maximization algorithm (weka clusters?) parallel to a closest path (A* or > bredth first?) to allow contextual suggestions on queries that never was > placed. > Usage: > Training > At user query time, create an instance of QueryResults containg the query > string, number of hits and a time stamp. Add it to a chronologically ordered > list in the user session (LinkedList makes sense) that you pass on to > train(sessionQueries) as the session times out. > You also want to call the bootstrap() method every 100000 queries or so. > Spell checking > Call getSuggestions(query) and look at the results. Don't modify it! This > method call will be hidden in a facade in future version. > Note that the spell checker is case sensitive, so you want to clean up query > the same way when you train as when you request the suggestions. > I recommend something like query = query.toLowerCase().replaceAll(" ", " > ").trim() -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]