[jira] [Updated] (LUCENE-3221) improve docvalues integration with scoring

Robert Muir (JIRA) Tue, 01 Oct 2013 11:41:28 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-3221:
--------------------------------

    Fix Version/s:     (was: flexscoring branch)
                   5.0

> improve docvalues integration with scoring
> ------------------------------------------
>
>                 Key: LUCENE-3221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3221
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/index
>            Reporter: Robert Muir
>             Fix For: 5.0
>
>
> Currently, the flexscoring branch is limited by the fact that you can at most 
> index one single byte per-document for scoring within Similarity.
> I added a simple test, showing how in your app itself you can index a 
> per-document value (such as a boost) and then use it in scoring: 
> http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/TestDocValuesScoring.java
> However, I think we should generalize this mechanism (note, names of classes 
> can be changed to whatver makes sense).
> In Similarity, instead of byte computeNorm(FieldInvertState), I think we 
> should have void computeNorm(StatsWriter, FieldInvertState).
> Then a Similarity can ask the StatsWriter for instance(s), where an instance 
> is something like a (name, type, aggregates) pair.
> Name would be a simple name like "boost" that the sim later uses to retrieve 
> this docvalue. type would be something like int8/int32/varint/byte.
> aggregates could at first be a boolean or whatever, I think at first we 
> should allow for the sum be be written (e.g. to provide sum and average).
> This would support aggregate statistics such as 'total number of tokens in 
> index' and 'average length'.
> so an example of the new computeNorm or whatever we call it would be
> {noformat}
>   void computeNorm(StatsWriter writer, FieldInvertState state) {
>     writer.getReference("length", INT32, 
> Aggregates.YES).write(state.numTokens);
>     writer.getReference("boost", FLOAT32, Aggregates.NO).write(state.boost);
>     ...
>   }
> {noformat}
> So these docvalues field names that the Sim writes, I think the sim should be 
> able to reference them with "relative" names like length and boost.
> Whatever we do behind the scenes is an implementation detail.
> Also for this to work, I think we need to add int8, int16, int32, ... types 
> to docvalues, and maybe we should add hasArray()/getArray(). I think
> the existing compressed INTS should be kept, but maybe renamed to varint or 
> something like that. This could still be useful, for example if someone
> wants to have "real document lengths" for bm25, but they don't really need a 
> full 32-bit range, they can make the tradeoff to use packed integers
> and load less into ram... but that should be the sim's choice to make.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3221) improve docvalues integration with scoring

Reply via email to