[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213939#comment-16213939
 ] 

Robert Muir commented on LUCENE-4100:
-------------------------------------

{quote}
Store more metadata in order to be able to compute better higher bounds for the 
scores, like the maximum term frequency. Maybe even allow similarity-specific 
stats?
{quote}

maximum term frequency may be a reasonable idea. Its imperfect but its simple, 
can be re-computed on merge, validated by CheckIndex, etc. The downside is that 
per-term stats are still fairly costly i think for blocktree, so would the 
performance benefits be worth it? I do like that it'd require no additional 
storage for an {{omitTF}} field since it is 1 there by definition.

for phrases it could still work too (just take min or max across all the terms 
in the phrase and pass to sim as a bad approximation). So instead of:
{code}
/** Return the maximum score that this scorer may produce.
  * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not 
bounded. */
public abstract float maxScore();
{code}
we'd have
{code}
/** Return the maximum score that this scorer may produce.
  * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not 
bounded.
  * @param maxFreq maximum possible frequency value
  */
public abstract float maxScore(float maxFreq);
{code}

I wonder how much it would improve your performance for BM25 though? I think we 
shouldn't add such a stat unless it really helps our defaults due to the costly 
nature of such statistics (not just cpu/storage, but api complexity, codec 
requirements, etc). On the side, by my math (just playing around with numbers), 
its still not enough to make the optimization potent for ClassicSimilarity, 
which may be the worst-case here :) 

Theoretically it'd be simple to independently expose the min value for field's 
norm to get even better (similarity could pull that itself from its 
LeafReader), but it'd be so fragile: messy data such as a single very short doc 
for the field will effectively negate that opto for the entire field. 
Alternatively storing min norm's value per-term might do it, once written it 
does not change so can be easily merged, cross-verified with norm's value in 
CheckIndex, etc, but its another thing to benchmark and compare the cost.

Allowing the similarity to store its own stuff seems impractical for a number 
of reasons: for example we couldn't get any better for BM25 without breaking 
incremental and distributed search (avgFieldLength unknown until runtime).

Another way to put it: with the current patch the Similarity has all the raw 
statistics it needs to compute its maxScore, except that norm and tf values are 
left unbounded, so i'd rather see those raw statistics stored if we want to 
improve their maxScore functions. As attractive as it sounds to combine them 
up-front for the best possible performance, I think its not practical given the 
features we support.

> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Stefan Pohl
>              Labels: api-change, gsoc2014, patch, performance
>             Fix For: 4.9, 6.0
>
>         Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to