[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413246#comment-13413246
 ] 

Stefan Pohl commented on LUCENE-4100:
-------------------------------------

Thanks for the feedback! You're on spot with everything you're saying.

Yes, the methods as suggested in the different papers have (semi-)static 
indexes in mind, that is, such that batch-index many new documents, then 
recompute maxscores (hence, IndexRewriter) and roll out the new version of the 
indexes. This is a Lucene use-case common to many large installations (or part 
thereof) and as such important. Moreover, this approach can easily be 
generalized to the other Similarities, without that they necessarily have to 
know about maxscore, and can be simplified by some minor API changes within 
Lucene. The PoC code as-is might be of help to showcase dependencies in 
general, and such that currently are not well supported within Lucene (because 
there was no need for it yet).

If you really want to go the full distance: I already thought about doing 
maxscore live and got some ideas in this regard, see below.

Comments to your thoughts:

[PostingsWriter]
You're right. For simplicity, I was computing each term's overall contribution 
(as explained in the talk), including all but query-dependent factors. You can 
consider this as un-quantized impacts (in the sense of Anh et al.) which 
necessitates a second pass over a static index, hence IndexRewriter.

As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it 
might be better not to extend the size of dictionary items, but to store 
maxscores in the beginning of inverted lists, or next to skip data. This effect 
should be smaller or disappear though when maxscores are not stored for many 
terms.

[Length normalization]
Yes, this might be a necessary dependency. It should be a general 
design-principle though to have as many as possible statistics at hand 
everywhere, as long as it doesn't hurt performance in terms of efficiency.

[splitting impacts / incremental indexing]
Yes, this would be more intrusive, requiring Similarity-dependent maxscore 
computations. Here is how it could work:
Very exotic scoring functions simply don't have to support maxscore and will 
thus fall back to the current score-all behaviour.
DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be 
factored out, as you correctly point out, but we could come up with bounds for 
collection statistics (those that go into the score) within which it is safe to 
use maxscore, otherwise we fallback to score-all until a merge occurs, or we 
notify the user to better do a merge/optimize, or Lucene does a segment-rewrite 
with new maxscore and bound computations on basis of more current collection 
stats. I got first ideas for an algorithm to compute these bounds.

[docfreq=1 treatment]
Definitely agree. Possibly, terms with docfreq < x=10? could not store a 
maxscore. x configurable and default to be evaluated; x should be stored in 
index so that it can be determined which terms don't contain maxscores.
Having a special treatment for these terms (not considering them for exclusion 
within the algorithm) allows for easier exchange of the core of the algorithm 
to get the WAND algorithm, or also to ignore a maxscore for a term for which 
collection stats went out of bounds.

[maxscores per posting ranges]
+1. As indicated in the description, having multiple maxscores per term can be 
more efficient, possibly leading to tighter bounds and more skipping. 
Chakrabarti'11 opted for one extreme, computing a maxscore for each compressed 
posting block, whereas the optimal choice might have been a multiple of blocks, 
or a postings range not well aligned with block size.
Optimal choice will be very dependent on skip list implementation and its 
parameters, but also posting de-compression overhead.
The question is how to get access to this codec-dependent information inside of 
the scoring algorithm, tunneled through the TermQuery?

[store <4 bytes per maxscore]
Possible. As long as the next higher representable real number is stored (ceil, 
not floor), no docs will be missed and the algorithm remains correct. But 
because of more loose bounds the efficiency gain will be affected at some point 
with too few bits.

If the score is anyway factored out, it might be better to simply store all 
document-dependent stats (TF, doclen) of the document with the maximum score 
contribution (as ints) instead of one aggregate intermediate float score 
contribution.

[implementation inside codec]
Please be aware that while terms are at some point excluded from merging, they 
still are advanced to the docs in other lists to gain complete document 
knowledge and compute exact scores. Maxscores can also be used to minimize how 
often this happens, but the gains are often compensated by the more complex 
scoring. Still having to skip inside of excluded terms complicates your 
suggested implementation. But we definitely should consider architecture 
alternatives. The MaxscoreCollector, for instance, does currently only have a 
user interface function, keeping track of the top-k and their entry threshold 
could well be done inside the Maxscorer.
I was thinking though to extend the MaxscoreCollector to provide different 
scoring information, e.g. an approximation of the number of hits next to the 
actual number of scored documents (currently totalHits).

                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to