On Tuesday 15 November 2005 23:45, Yonik Seeley wrote:
> Totally untested, but here is a hack at what the scorer might look
> like when the number of terms is large.
>
> -Yonik
>
>
> package org.apache.lucene.search;
>
> import org.apache.lucene.index.TermEnum;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.TermDocs;
>
> import java.io.IOException;
>
> /**
> * @author yonik
> * @version $Id$
> */
> public class MultiTermScorer extends Scorer{
> protected final float[] scores;
> protected int pos;
> protected float docScore;
>
> public MultiTermScorer(Similarity similarity, IndexReader reader,
> Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean
> include_tf) throws IOException {
> super(similarity);
> float weightVal = w.getValue();
> int maxDoc = reader.maxDoc();
> this.scores = new float[maxDoc];
> float[] normDecoder = Similarity.getNormDecoder();
>
> TermDocs tdocs = reader.termDocs();
This part is only needed at the top level of the query, so
one could implement in this optimization hook of BooleanScorer:
/** Expert: Collects matching documents in a range.
* <br>Note that [EMAIL PROTECTED] #next()} must be called once before this
method is
* called for the first time.
* @param hc The collector to which all matching documents are passed
through
* [EMAIL PROTECTED] HitCollector#collect(int, float)}.
* @param max Do not score documents past this.
* @return true if more matching documents may remain.
*/
protected boolean score(HitCollector hc, int max) throws IOException {
...
}
> while (terms.next()) {
> tdocs.seek(terms);
terms.term() iirc.
> float termScore = weightVal;
> if (include_idf) {
> termScore *= similarity.idf(terms.docFreq(),maxDoc);
> }
> while (tdocs.next()) {
> int doc = tdocs.doc();
> float subscore = termScore;
> if (include_tf) subscore *= tdocs.freq();
getSimilarity().tf(tdocs.freq());
> if (norms!=null) subscore *= normDecoder[norms[doc&0xff]];
> scores[doc] += subscore;
The scores[] array is the pain point, but when it can be used
this can be generalized to DisjunctionSumScorer, so it would
work for all disjunctions, not only terms.
I think it is possible to implement this hook for
DisjunctionSumScorer with a scores[] array, iterating over the
subscorers one by one.
Getting that hook called through BooleanScorer2 is no problem
when the coordination factor can be left out.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]