[ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]

Karl Wettin updated LUCENE-626:
-------------------------------

    Attachment: spellcheck_20060804.tar.gz

beta 3

total rewrite with focus on adaptation.

session search sequence extraction, training and suggesting are now seperate 
classes passed to the spell checker.

still require lots of user interaction to build a sufficient dictionary.

has no optimization. bootstrap has been removed and will probably re-appear in 
future default suggestion scheme instead. should be fast enough.

now also comes with some junit test cases.

default implementations are quite simple, but effective: strips suggestive data 
(trained suggestive- and test phrases) from punctuation and whitespace in order 
to find incorrect composite and decomposed words. e.g. "the davinci code" --> 
"the da vinci code", "a clock work orange" --> "a clockwork orage".


beta 4 will focus on training- and suggestion classes that works on secondary 
trie populated with known good data extracted from corpus, navigated with edit 
distance. perhaps a forest-type trie to allow any starting point in a phrase. 

OR

beta 4 will focus on discrimiating trained queries to build clusters and 
suggest (facet) classes parallell to a plain text suggestion. that would be a 
major ram-consumer and require lots of manual tweaking per implemenation, but a 
cool enough feature.

time will tell.

> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
>                 Key: LUCENE-626
>                 URL: http://issues.apache.org/jira/browse/LUCENE-626
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz, 
> spellcheck_20060804.tar.gz
>
>
> From javadocs:
>  This is an adaptive, user query session analyzing spell checker. In plain 
> words, a word and phrase dictionary that will learn from how users act while 
> searching.
> Be aware, this is a beta version. It is not finished, but yeilds great 
> results if you have enough user activity, RAM and a faily narrow document 
> corpus. The RAM problem can be fixed if you implement your own subclass of 
> SpellChecker as the abstract methods of this class are the CRUD methods. This 
> will most probably change to a strategy class in future version.
> TODO:
> 1. Gram up results to detect compositewords that should not be composite 
> words, and vice verse.
> 2. Train a gramed token (markov) chain with output from an expectation 
> maximization algorithm (weka clusters?) parallel to a closest path (A* or 
> bredth first?) to allow contextual suggestions on queries that never was 
> placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults containg the query 
> string, number of hits and a time stamp. Add it to a chronologically ordered 
> list in the user session (LinkedList makes sense) that you pass on to 
> train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every 100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results. Don't modify it! This 
> method call will be hidden in a facade in future version.
> Note that the spell checker is case sensitive, so you want to clean up query 
> the same way when you train as when you request the suggestions.
> I recommend something like query = query.toLowerCase().replaceAll(" ", " 
> ").trim() 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to