[
https://issues.apache.org/jira/browse/LUCENE-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-3221:
--------------------------------
Fix Version/s: (was: flexscoring branch)
5.0
> improve docvalues integration with scoring
> ------------------------------------------
>
> Key: LUCENE-3221
> URL: https://issues.apache.org/jira/browse/LUCENE-3221
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/index
> Reporter: Robert Muir
> Fix For: 5.0
>
>
> Currently, the flexscoring branch is limited by the fact that you can at most
> index one single byte per-document for scoring within Similarity.
> I added a simple test, showing how in your app itself you can index a
> per-document value (such as a boost) and then use it in scoring:
> http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/TestDocValuesScoring.java
> However, I think we should generalize this mechanism (note, names of classes
> can be changed to whatver makes sense).
> In Similarity, instead of byte computeNorm(FieldInvertState), I think we
> should have void computeNorm(StatsWriter, FieldInvertState).
> Then a Similarity can ask the StatsWriter for instance(s), where an instance
> is something like a (name, type, aggregates) pair.
> Name would be a simple name like "boost" that the sim later uses to retrieve
> this docvalue. type would be something like int8/int32/varint/byte.
> aggregates could at first be a boolean or whatever, I think at first we
> should allow for the sum be be written (e.g. to provide sum and average).
> This would support aggregate statistics such as 'total number of tokens in
> index' and 'average length'.
> so an example of the new computeNorm or whatever we call it would be
> {noformat}
> void computeNorm(StatsWriter writer, FieldInvertState state) {
> writer.getReference("length", INT32,
> Aggregates.YES).write(state.numTokens);
> writer.getReference("boost", FLOAT32, Aggregates.NO).write(state.boost);
> ...
> }
> {noformat}
> So these docvalues field names that the Sim writes, I think the sim should be
> able to reference them with "relative" names like length and boost.
> Whatever we do behind the scenes is an implementation detail.
> Also for this to work, I think we need to add int8, int16, int32, ... types
> to docvalues, and maybe we should add hasArray()/getArray(). I think
> the existing compressed INTS should be kept, but maybe renamed to varint or
> something like that. This could still be useful, for example if someone
> wants to have "real document lengths" for bm25, but they don't really need a
> full 32-bit range, they can make the tradeoff to use packed integers
> and load less into ram... but that should be the sim's choice to make.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]