Re: scorer optimizations

Christoph Goller Thu, 22 Apr 2004 07:08:36 -0700

Doug,

I will not be able to do any Lucene work until May 5th and I
will not have much time for emails either. I have to do a
contract programming job for one of my customers and
unfortunately it´s not Lucene-related :-(

Christoph

Doug Cutting wrote:

Christoph,

Thanks for finding fixing these. This reminds me of a couple of scorer-related optimizations that we should keep in mind:

1. As you've noticed, the logic of ConjunctionScorer and PhraseScorer (and SpansNear, for that matter) are very similar, except ConjunctionScorer uses java's collection classes, while PhraseScorer uses Lucene's PriorityQueue and a home-grown linked list. Because of this, PhraseScorer allocates no new objects while scoring, while ConjuctionScorer must allocate a number of objects each time skipTo() is called. ConjunctionScorer is still usually much faster than BooleanScorer, since it uses SkipTo, but it could be much faster yet. So someday we should probably re-write ConjunctionScorer to use a PriorityQueue, so that it allocates fewer objects while scoring.

Sounds reasonable and shouldn´t bee too difficult.

2. BooleanScorer does not return hits in strict document-id order (arguably a bug). Because of this, ConjunctionScorer cannot be used when any of its sub-scorers use BooleanScorer, i.e., with sub-queries that contain optional or prohibited clauses. This is unfortunate, as there are probably lots of cases where a conjunction of disjunctions could be accellerated by skipTo() that we are not accellerating. One way to fix this is to re-write BooleanScorer so that it always returns hits in order, using an algorithm like that in SpanOrQuery.getSpans(). However, that would make large disjunctions somewhat slower, by a factor of log(n), where n is the number of clauses in the disjunction. I don't know whether this would really be significant. Another approach might be to write a version of BooleanScorer that returns hits in order, and only to use it when it is embedded in a conjunction. This is trickier to implement, and it would be best if BooleanScorer always returned hits in the correct order, so I lean slightly towards the first approach.


Instead of holding active buckets in a linked list, we could maintain a
priority queue ordered by document-id.  A simple change as far as I see.
This would mean that next() would  at least return documents in  the
correct order and BooleanScorer  could implement skipTo (of course not
optimized). However, this would off course imply the overhead of keeping
the heap of the priority queue in order, compared to the simple list insert.
I think the overhead is acceptable.

What about optimizing boolean queries that contain required subqueries?
Perhaps it would make sense to treat required subqueries different from
the other subqueries in a future boolean scorer. However, I don´t have
very concrete ideas yet.

PHRASE AND SPAN Queries
**************************
As far as I see, PhraseQueries and PhrasePrefixQueries are completely
subsumed by the new SpanQueries. Maybe PhraseQueries and
PhrasePrefixQueries could be eliminated at least from version 2.0?

I would like to see the stopWords/positionIncrement problem solved
for 1.4. You remember that Erik introduced a position increment
for stop words, and then the PhraseQueries didn´t work as expected.
Currently, phrases match across stop words, since positions of stop words
are ignored. I think PhraseQuery, PhrasePrefixQuery, and SpanNearQuery
could be changed (give relative positions for each part in the constructor),
however I am not completely sure whether my assumption is correct.

I studied the Span-code today. Thought it would be a good time for doing it
since I had  already loaded the query stuff into my head from yesterday´s
debugging :-)

I have some questions/remarks that might be worth considering:

*) SpanScorer:
  public boolean next() throws IOException {
    if (firstTime) {
      more = spans.next();
      firstTime = false;
    }

if (!more) return false;

    freq = 0.0f;
    doc = spans.doc();

    while (more && doc == spans.doc()) {
      int matchLength = spans.end() - spans.start();
      freq += getSimilarity().sloppyFreq(matchLength);
      more = spans.next();
    }

    return more || freq != 0.0f;
  }

If freq == 0 shouldn´t we switch to the next document within next()?
This is the way PhraseScorer handles this case.

Furthermore, shouldn´t the while-loop in skipTo be identical to the one
in next()? However, instead of doc, it compares target with spans.doc()!

*) SpanTermQuery: Shouldn´t start and end of its Spans always return the
same value, so that in the special case of a single SpanTermQuery freq
becomes the usual value of 1 with DefaultSimilarity?

*) SpanQuery: Collection getTerms() Wouldn´t it be better to use Set internally in ordet to avoid dublicates, or do we need dublicates?

*) SpanOrQuery: all no longer needed when everything is in queue. However
is kept up to date.

*) SpanNearQuery.getSpans(...): I don´t understand the case clauses.size == 0.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: scorer optimizations

Reply via email to