[jira] [Updated] (LUCENE-4845) Add AnalyzingInfixSuggester

Michael McCandless (JIRA) Mon, 18 Mar 2013 15:53:17 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-4845:
---------------------------------------

    Attachment: LUCENE-4845.patch

Another iteration:

  * I use SortingAtomicReader to sort all docs by weight (impact
    sorted postings), and then during the search I stop after
    collecting the first N docs.

  * I index leading ngrams up to a limit (default 4 characters) and
    use those instead of PrefixQuery when the last term is short.

  * I switched to a custom highlighter so prefix matches will always
    highlight correctly.

I tested on the FreeDB corpus (song titles) ... this is a pretty big
set of suggestions: 44.5 M songs across 3.2 M albums.  I pick a random
subset of the titles, and then test 2,4,6,8 length prefixes:

  * 852 sec to build
  * 3.7 GB index
  * Prefix 2: 50656.1 lookups/sec
  * Prefix 4: 1361.0 lookups/sec
  * Prefix 6: 7291.0 lookups/sec
  * Prefix 8: 5364.5 lookups/sec
  * Prefix 10: 4144.0 lookups/sec

Eg AnalyzingSuggester (which doesn't highlight so it's not quite a
fair comparison):

  * 641 sec to build
  * 2.1 GB FST
  * Prefix 2: 9719.3 lookups/sec
  * Prefix 4: 15750.2 lookups/sec
  * Prefix 6: 21491.4 lookups/sec
  * Prefix 8: 27453.4 lookups/sec
  * Prefix 10: 33168.4 lookups/sec

So it's quite a bit slower than AnalyzingSuggester but I think it's
still plenty fast for most apps (this is perf for a single thread).

                
> Add AnalyzingInfixSuggester
> ---------------------------
>
>                 Key: LUCENE-4845
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4845
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.3
>
>         Attachments: infixSuggest.png, LUCENE-4845.patch, LUCENE-4845.patch
>
>
> Our current suggester impls do prefix matching of the incoming text
> against all compiled suggestions, but in some cases it's useful to
> allow infix matching.  E.g, Netflix does infix suggestions in their
> search box.
> I did a straightforward impl, just using a normal Lucene index, and
> using PostingsHighlighter to highlight matching tokens in the
> suggestions.
> I think this likely only works well when your suggestions have a
> strong prior ranking (weight input to build), eg Netflix knows
> the popularity of movies.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4845) Add AnalyzingInfixSuggester

Reply via email to