[ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213939#comment-16213939 ]
Robert Muir commented on LUCENE-4100: ------------------------------------- {quote} Store more metadata in order to be able to compute better higher bounds for the scores, like the maximum term frequency. Maybe even allow similarity-specific stats? {quote} maximum term frequency may be a reasonable idea. Its imperfect but its simple, can be re-computed on merge, validated by CheckIndex, etc. The downside is that per-term stats are still fairly costly i think for blocktree, so would the performance benefits be worth it? I do like that it'd require no additional storage for an {{omitTF}} field since it is 1 there by definition. for phrases it could still work too (just take min or max across all the terms in the phrase and pass to sim as a bad approximation). So instead of: {code} /** Return the maximum score that this scorer may produce. * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not bounded. */ public abstract float maxScore(); {code} we'd have {code} /** Return the maximum score that this scorer may produce. * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not bounded. * @param maxFreq maximum possible frequency value */ public abstract float maxScore(float maxFreq); {code} I wonder how much it would improve your performance for BM25 though? I think we shouldn't add such a stat unless it really helps our defaults due to the costly nature of such statistics (not just cpu/storage, but api complexity, codec requirements, etc). On the side, by my math (just playing around with numbers), its still not enough to make the optimization potent for ClassicSimilarity, which may be the worst-case here :) Theoretically it'd be simple to independently expose the min value for field's norm to get even better (similarity could pull that itself from its LeafReader), but it'd be so fragile: messy data such as a single very short doc for the field will effectively negate that opto for the entire field. Alternatively storing min norm's value per-term might do it, once written it does not change so can be easily merged, cross-verified with norm's value in CheckIndex, etc, but its another thing to benchmark and compare the cost. Allowing the similarity to store its own stuff seems impractical for a number of reasons: for example we couldn't get any better for BM25 without breaking incremental and distributed search (avgFieldLength unknown until runtime). Another way to put it: with the current patch the Similarity has all the raw statistics it needs to compute its maxScore, except that norm and tf values are left unbounded, so i'd rather see those raw statistics stored if we want to improve their maxScore functions. As attractive as it sounds to combine them up-front for the best possible performance, I think its not practical given the features we support. > Maxscore - Efficient Scoring > ---------------------------- > > Key: LUCENE-4100 > URL: https://issues.apache.org/jira/browse/LUCENE-4100 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs, core/query/scoring, core/search > Affects Versions: 4.0-ALPHA > Reporter: Stefan Pohl > Labels: api-change, gsoc2014, patch, performance > Fix For: 4.9, 6.0 > > Attachments: LUCENE-4100.patch, LUCENE-4100.patch, > contrib_maxscore.tgz, maxscore.patch > > > At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient > algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, > that I find deserves more attention among Lucene users (and developers). > I implemented a proof of concept and did some performance measurements with > example queries and lucenebench, the package of Mike McCandless, resulting in > very significant speedups. > This ticket is to get started the discussion on including the implementation > into Lucene's codebase. Because the technique requires awareness about it > from the Lucene user/developer, it seems best to become a contrib/module > package so that it consciously can be chosen to be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org