[ 
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment: spellchecker.diff

It uses the ngram spell checker for queries yet not corrected by users, but it 
handles more than one word at the time, and it inspects the term position 
vector if available. This way it can also rearange input to the most probable 
order.

    addDocument(indexWriter, field, "heroes of might and magic III complete");
    addDocument(indexWriter, field, "it might be the best game ever made");

    assertEquals("heroes of might and magic", suggester.didYouMean("hereos of 
magic and might"));
    assertEquals("heroes of might and magic", suggester.didYouMean("hereos of 
light and magic"));
    assertEquals("heroes might magic", suggester.didYouMean("magic light 
heros"));
    assertEquals("best game made", suggester.didYouMean("game best made"));
    assertEquals("game made", suggester.didYouMean("made game"));
    assertEquals("game made", suggester.didYouMean("made lame"));
    assertEquals(null, suggester.didYouMean("may game"));

Once someone clicks on a suggestion (you have to report this back to the 
suggester) it will get a higher priority. If the person reports interest in one 
or many of the results in the followed suggested query, it will get an even 
higher priority. If something is suggested but not clicked on, then the 
priority will go down. When the priority reaches a lower threadshold, it will 
no loger be suggested, and the next best suggestion will appear. And so on.

To change the query manually is the same thing as clicking on a suggestions, 
given it is similar enough and withing a certain timeframe. 

    assertEquals("homm", suggester.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", suggester.didYouMean("heroes of 
night and magic"));

    assertEquals("homm", suggester.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", suggester.didYouMean("homm"));

The data is stored in a Map<String /*query*/, List<Suggestion>>, and the 
default implementation strips the query from p{Punct}. That should help with 
composite and decomposite, amongst much.

    assertEquals("the da vinci code", suggester.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", suggester.didYouMean("the dav-inci 
code"));
    assertEquals("heroes of might and magic", suggester.didYouMean("heroes 
ofnight andmagic"));

It seems as the ngram spell check tests is broken - requires the removed class 
English. I've re-introduced it in Lucene-550.

I will not work further on this patch and issue. It will be added to Lucene-550 
for chaching and such.

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: https://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellchecker.diff
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing spell checker. In plain 
> words, a word and phrase dictionary that will learn from how users act while 
> searching.
> Be aware, this is a beta version. It is not finished, but yeilds great 
> results if you have enough user activity, RAM and a faily narrow document 
> corpus. The RAM problem can be fixed if you implement your own subclass of 
> SpellChecker as the abstract methods of this class are the CRUD methods. This 
> will most probably change to a strategy class in future version.
> TODO:
> 1. Gram up results to detect compositewords that should not be composite 
> words, and vice verse.
> 2. Train a gramed token (markov) chain with output from an expectation 
> maximization algorithm (weka clusters?) parallel to a closest path (A* or 
> bredth first?) to allow contextual suggestions on queries that never was 
> placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults containg the query 
> string, number of hits and a time stamp. Add it to a chronologically ordered 
> list in the user session (LinkedList makes sense) that you pass on to 
> train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every 100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results. Don't modify it! This 
> method call will be hidden in a facade in future version.
> Note that the spell checker is case sensitive, so you want to clean up query 
> the same way when you train as when you request the suggestions.
> I recommend something like query = query.toLowerCase().replaceAll(" ", " 
> ").trim() 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to