[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Robert Muir (JIRA) Thu, 12 Jul 2012 16:49:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413335#comment-13413335
 ]


Robert Muir commented on LUCENE-4100:
-------------------------------------

{quote}
As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it 
might be better not to extend the size of dictionary items, but to store 
maxscores in the beginning of inverted lists, or next to skip data. This effect 
should be smaller or disappear though when maxscores are not stored for many 
terms.
{quote}

I wouldn't worry about this, I noticed a few things that might speed that up:

1. Currently it does writeVInt(Float.floatToIntBits(term.maxscoreScore)) . But 
I think this should be writeInt, not writeVInt?
So I think currently we often write 5 bytes here, with all the vint checks for 
each byte, and as an Int it would always be 4 and faster.
2. Yes, with low freq terms (e.g. docFreq < skipMinimum), its probably best to 
just omit this at both read and write time. Then PK lookup would be fine.
3. As far as < 4 bytes ceiling, my motivation there was not to save in the term 
dictionary, but instead to make these smaller and allow us to add these at 
regular intervals. We can take advantage of a few things, e.g. it should never 
be a negative number for a 
well-formed Similarity (i think that would screw up the algorithm looking at 
your tests anyway).

{quote}
DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be 
factored out, as you correctly point out, but we could come up with bounds for 
collection statistics (those that go into the score) within which it is safe to 
use maxscore, otherwise we fallback to score-all until a merge occurs, or we 
notify the user to better do a merge/optimize, or Lucene does a segment-rewrite 
with new maxscore and bound computations on basis of more current collection 
stats. I got first ideas for an algorithm to compute these bounds.
{quote}

Ok, I'm not sure I totally see how the bounds computation can work, but if it 
can we might be ok in general. If the different segments are somewhat 
homogeneous then these stats should pretty much be very close anyway.

The other idea i had was more intrusive, adding a computeImpact() etc to 
Similarity or whatever.

{quote}
If the score is anyway factored out, it might be better to simply store all 
document-dependent stats (TF, doclen) of the document with the maximum score 
contribution (as ints) instead of one aggregate intermediate float score 
contribution.
{quote}

That might be a good idea. with TF as a vint and doclen as a byte, we would 
typically only have two bytes but not actually lose any information (by 
default, all these sims encode doclen as a byte anyway).

{quote}
[implementation inside codec]
Please be aware that while terms are at some point excluded from merging, they 
still are advanced to the docs in other lists to gain complete document 
knowledge and compute exact scores. Maxscores can also be used to minimize how 
often this happens, but the gains are often compensated by the more complex 
scoring. Still having to skip inside of excluded terms complicates your 
suggested implementation. But we definitely should consider architecture 
alternatives. The MaxscoreCollector, for instance, does currently only have a 
user interface function, keeping track of the top-k and their entry threshold 
could well be done inside the Maxscorer.
I was thinking though to extend the MaxscoreCollector to provide different 
scoring information, e.g. an approximation of the number of hits next to the 
actual number of scored documents (currently totalHits).
{quote}

My current line of thinking is even crazier, but I don't yet have anything 
close to a plan.

As a start, I do think that IndexSearcher.search() methods should take a "Score 
Mode" of sorts from the user (some enum), which would allow Lucene to do less 
work if its not necessary. We would pass this down via Weight.scorer() as a 
parameter... solely
looking at the search side I think this would open up opportunities in general 
for us to optimize things: e.g. instantiate the appropriate Collector impl, and 
for Weights to create the most optimal Scorers. Not yet sure how it would tie 
into the code API.

I started hacking up on a prototype that looks like this (I might have tried to 
refactor too hard also shoving the Sort options in here...)
{noformat}
/**
 * Different modes of search.
 */
public enum ScoreMode {
  /** 
   * No guarantees that the ranking is correct,
   * the results may come back in a different order than if all
   * documents were actually scored. Total hit count may be 
   * unavailable or approximate.
   */
  APPROXIMATE,
  /** 
   * Ranking is the same as {@link COMPLETE}, but total hit 
   * count may be unavailable or approximate.
   */
  SAFE,
  /**
   * Guarantees complete iteration over all documents, but scores
   * may be unavailable.
   */
  COMPLETE_NO_SCORES,
  /**
   * Guarantees complete iteration over all documents, and scores
   * will be computed, but the maximum score may be unavailable 
   */
  COMPLETE_NO_MAX_SCORE,
  /**
   * Guarantees complete iteration and scoring of all documents.
   */
  COMPLETE,
};
{noformat}
                
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

Reply via email to