[
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wettin updated LUCENE-626:
-------------------------------
Attachment: LUCENE-626_20070817.patch
As the phrase-suggestion layer on top of contrib/spell in this patch was noted
in a bunch of forums the last weeks, I've removed the 550-dependency and
brought it up to date with the trunk.
Second level suggesting (ngram token, phrase) can run stand alone. See
TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it
will act as a cache on top of second level suggestions. (See docs.)
Output from using adaptive layer only, i.e. suggestions based on how users
previously behaved. About half a million user queries analyed to build the
dictionary (takes 30 seconds to build on my dual core):
3ms pirates ofthe caribbean -> pirates of the caribbean
2ms pirates of the carribbean -> pirates of the caribbean
0ms pirates carricean -> pirates caribbean
1ms pirates of the carriben -> pirates of the caribbean
0ms pirates of the carabien -> pirates of the caribbean
0ms pirates of the carabbean -> pirates of the caribbean
1ms pirates og carribean -> pirates of the caribbean
0ms pirates of the caribbean music -> pirates of the caribbean soundtrack
0ms pirates of the caribbean soundtrack -> pirates of the caribbean score
0ms pirate of carabian -> pirate of caribbean
0ms pirate of caribbean -> pirates of caribbean
0ms pirates of caribbean -> pirates of caribbean
0ms homm 4 -> homm iv
0ms the pilates -> null
Using the phrase ngram token suggestion using token matrices checked against an
apriori index. A lot of queries required for one suggestion. Instantiated index
as apriori saves plenty of millis. This is expensive stuff, but works pretty
good.
72ms the pilates -> the pirates
440ms heroes of fight and magic -> heroes of might and magic
417ms heroes of right and magic -> heroes of might and magic
383ms heroes of magic and light -> heroes of might and magic
20ms heroesof lightand magik -> null
385ms heroes of light and magik -> heroes of might and magic
0ms heroesof lightand magik -> heroes of might and magic
385ms heroes of magic and might -> heroes of might and magic
(That 0ms is becase previous was cached. One does not have to use this cache.)
> Extended spell checker with phrase support and adaptive user session analysis.
> ------------------------------------------------------------------------------
>
> Key: LUCENE-626
> URL: https://issues.apache.org/jira/browse/LUCENE-626
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Karl Wettin
> Assignee: Karl Wettin
> Priority: Minor
> Attachments: didyoumean.patch.bz2, LUCENE-626_20070817.patch,
> spellchecker.diff
>
>
> Extensive java docs available in patch, but I try to keep it compiled here:
> http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> The patch spellcheck.diff should not depend on anything but Lucene trunk. It
> has basic support for phrase suggestions and query goal detection, but is
> pretty buggy and lacks features available in didyoumean.diff.bz2. The latter
> depends on LUCENE-550.
> Example:
> {code:java}
> public void testImportData() throws Exception {
> // load 200 000 user queries with session data and time stamp. no goals
> specified.
> System.out.println("Processing
> http://ginandtonique.org/~kalle/data/pirate.data.gz");
> importFile(new InputStreamReader(new GZIPInputStream(new
> URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
> System.out.println("Processing
> http://ginandtonique.org/~kalle/data/hero.data.gz");
> importFile(new InputStreamReader(new GZIPInputStream(new
> URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
> System.out.println("Done.");
> // run some tests without the second level suggestions,
> // i.e. user behavioral data only. no ngrams or so.
>
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe
> caribbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carribbean"));
> assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carriben"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carabien"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates of
> the carabbean"));
> assertEquals("pirates of the caribbean", facade.didYouMean("pirates og
> carribean"));
> assertEquals("pirates of the caribbean soundtrack",
> facade.didYouMean("pirates of the caribbean music"));
> assertEquals("pirates of the caribbean score", facade.didYouMean("pirates
> of the caribbean soundtrack"));
> assertEquals("pirate of caribbean", facade.didYouMean("pirate of
> carabian"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirate of
> caribbean"));
> assertEquals("pirates of caribbean", facade.didYouMean("pirates of
> caribbean"));
> // depening on how many hits and goals are noted with these two queries
> // perhaps the delta should be added to a synonym dictionary?
> assertEquals("homm iv", facade.didYouMean("homm 4"));
> // not yet known.. and we have no second level yet.
> assertNull(facade.didYouMean("the pilates"));
> // use the dictionary built from user queries to build the token phrase
> and ngram suggester.
>
> facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()),
> 1d);
> // now it's learned
> assertEquals("the pirates", facade.didYouMean("the pilates"));
> // typos
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> fight and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> right and magic"));
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> magic and light"));
> // composite dictionary key not learned yet..
> assertEquals(null, facade.didYouMean("heroesof lightand magik"));
> // learn
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> light and magik"));
> // test
> assertEquals("heroes of might and magic", facade.didYouMean("heroesof
> lightand magik"));
> // wrong term order
> assertEquals("heroes of might and magic", facade.didYouMean("heroes of
> magic and might"));
> }
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]