[ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]
Karl Wettin updated LUCENE-626:
-------------------------------
Attachment: spellcheck_20060804.tar.gz
beta 3
total rewrite with focus on adaptation.
session search sequence extraction, training and suggesting are now seperate
classes passed to the spell checker.
still require lots of user interaction to build a sufficient dictionary.
has no optimization. bootstrap has been removed and will probably re-appear in
future default suggestion scheme instead. should be fast enough.
now also comes with some junit test cases.
default implementations are quite simple, but effective: strips suggestive data
(trained suggestive- and test phrases) from punctuation and whitespace in order
to find incorrect composite and decomposed words. e.g. "the davinci code" -->
"the da vinci code", "a clock work orange" --> "a clockwork orage".
beta 4 will focus on training- and suggestion classes that works on secondary
trie populated with known good data extracted from corpus, navigated with edit
distance. perhaps a forest-type trie to allow any starting point in a phrase.
OR
beta 4 will focus on discrimiating trained queries to build clusters and
suggest (facet) classes parallell to a plain text suggestion. that would be a
major ram-consumer and require lots of manual tweaking per implemenation, but a
cool enough feature.
time will tell.
> Adaptive, user query session analyzing spell checker.
> -----------------------------------------------------
>
> Key: LUCENE-626
> URL: http://issues.apache.org/jira/browse/LUCENE-626
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Search
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz,
> spellcheck_20060804.tar.gz
>
>
> From javadocs:
> This is an adaptive, user query session analyzing spell checker. In plain
> words, a word and phrase dictionary that will learn from how users act while
> searching.
> Be aware, this is a beta version. It is not finished, but yeilds great
> results if you have enough user activity, RAM and a faily narrow document
> corpus. The RAM problem can be fixed if you implement your own subclass of
> SpellChecker as the abstract methods of this class are the CRUD methods. This
> will most probably change to a strategy class in future version.
> TODO:
> 1. Gram up results to detect compositewords that should not be composite
> words, and vice verse.
> 2. Train a gramed token (markov) chain with output from an expectation
> maximization algorithm (weka clusters?) parallel to a closest path (A* or
> bredth first?) to allow contextual suggestions on queries that never was
> placed.
> Usage:
> Training
> At user query time, create an instance of QueryResults containg the query
> string, number of hits and a time stamp. Add it to a chronologically ordered
> list in the user session (LinkedList makes sense) that you pass on to
> train(sessionQueries) as the session times out.
> You also want to call the bootstrap() method every 100000 queries or so.
> Spell checking
> Call getSuggestions(query) and look at the results. Don't modify it! This
> method call will be hidden in a facade in future version.
> Note that the spell checker is case sensitive, so you want to clean up query
> the same way when you train as when you request the suggestions.
> I recommend something like query = query.toLowerCase().replaceAll(" ", "
> ").trim()
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]