[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Robert Muir (JIRA) Thu, 12 Jul 2012 11:35:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413055#comment-13413055
 ]


Robert Muir commented on LUCENE-4100:
-------------------------------------

{quote}
Your index at 1) does not have to be 'optimized' (it does not have to consist 
of one index segment only). In fact, maxscore can be more efficient with 
multiple segments because multiple maxscores are computed for many frequent 
terms for subsets of documents, resulting in tighter bounds and more effective 
pruning.
{quote}

I've been thinking about this a lot lately: while what you say is true, thats 
because you reprocess all segments with IndexRewriter (which is fine for a 
static collection).

But this algorithm in general is not rank safe with incremental indexing: the 
problem is that when doing actual scoring,
scores consist of per-segment/within document stats (term frequency, document 
length), but also are affected by collection-wide
statistics from many other segments (IDF, average document length, ...) or even 
machines in a distributed collection.

So I think for this to work and remain rank-safe, we cannot write the entire 
score into the segment, because the score
at actual search time is dependent on all the other segments being searched. 
Instead I think this can only work when
we can easily factor out an impact (e.g. in the case of DefaultSimilarity the 
indexed maxscore excludes the IDF component,
this is instead multiplied in at search time).

I don't see how it can be rank-safe with algorithms like BM25 and incremental 
indexing, where parameters like average document
length are not simple multiplicative factors into the formula: and determine 
exactly how important tf versus document length play
a role in the score, but I'll think about it some more.

                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Reply via email to