[
https://issues.apache.org/jira/browse/LUCENE-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-4845:
---------------------------------------
Attachment: LUCENE-4845.patch
Another iteration:
* I use SortingAtomicReader to sort all docs by weight (impact
sorted postings), and then during the search I stop after
collecting the first N docs.
* I index leading ngrams up to a limit (default 4 characters) and
use those instead of PrefixQuery when the last term is short.
* I switched to a custom highlighter so prefix matches will always
highlight correctly.
I tested on the FreeDB corpus (song titles) ... this is a pretty big
set of suggestions: 44.5 M songs across 3.2 M albums. I pick a random
subset of the titles, and then test 2,4,6,8 length prefixes:
* 852 sec to build
* 3.7 GB index
* Prefix 2: 50656.1 lookups/sec
* Prefix 4: 1361.0 lookups/sec
* Prefix 6: 7291.0 lookups/sec
* Prefix 8: 5364.5 lookups/sec
* Prefix 10: 4144.0 lookups/sec
Eg AnalyzingSuggester (which doesn't highlight so it's not quite a
fair comparison):
* 641 sec to build
* 2.1 GB FST
* Prefix 2: 9719.3 lookups/sec
* Prefix 4: 15750.2 lookups/sec
* Prefix 6: 21491.4 lookups/sec
* Prefix 8: 27453.4 lookups/sec
* Prefix 10: 33168.4 lookups/sec
So it's quite a bit slower than AnalyzingSuggester but I think it's
still plenty fast for most apps (this is perf for a single thread).
> Add AnalyzingInfixSuggester
> ---------------------------
>
> Key: LUCENE-4845
> URL: https://issues.apache.org/jira/browse/LUCENE-4845
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spellchecker
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, 4.3
>
> Attachments: infixSuggest.png, LUCENE-4845.patch, LUCENE-4845.patch
>
>
> Our current suggester impls do prefix matching of the incoming text
> against all compiled suggestions, but in some cases it's useful to
> allow infix matching. E.g, Netflix does infix suggestions in their
> search box.
> I did a straightforward impl, just using a normal Lucene index, and
> using PostingsHighlighter to highlight matching tokens in the
> suggestions.
> I think this likely only works well when your suggestions have a
> strong prior ranking (weight input to build), eg Netflix knows
> the popularity of movies.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]