I will not be able to do any Lucene work until May 5th and I will not have much time for emails either. I have to do a contract programming job for one of my customers and unfortunately it´s not Lucene-related :-(
Christoph
Doug Cutting wrote:
Christoph,
Thanks for finding fixing these. This reminds me of a couple of scorer-related optimizations that we should keep in mind:
1. As you've noticed, the logic of ConjunctionScorer and PhraseScorer (and SpansNear, for that matter) are very similar, except ConjunctionScorer uses java's collection classes, while PhraseScorer uses Lucene's PriorityQueue and a home-grown linked list. Because of this, PhraseScorer allocates no new objects while scoring, while ConjuctionScorer must allocate a number of objects each time skipTo() is called. ConjunctionScorer is still usually much faster than BooleanScorer, since it uses SkipTo, but it could be much faster yet. So someday we should probably re-write ConjunctionScorer to use a PriorityQueue, so that it allocates fewer objects while scoring.
Sounds reasonable and shouldn´t bee too difficult.
2. BooleanScorer does not return hits in strict document-id order (arguably a bug). Because of this, ConjunctionScorer cannot be used when any of its sub-scorers use BooleanScorer, i.e., with sub-queries that contain optional or prohibited clauses. This is unfortunate, as there are probably lots of cases where a conjunction of disjunctions could be accellerated by skipTo() that we are not accellerating. One way to fix this is to re-write BooleanScorer so that it always returns hits in order, using an algorithm like that in SpanOrQuery.getSpans(). However, that would make large disjunctions somewhat slower, by a factor of log(n), where n is the number of clauses in the disjunction. I don't know whether this would really be significant. Another approach might be to write a version of BooleanScorer that returns hits in order, and only to use it when it is embedded in a conjunction. This is trickier to implement, and it would be best if BooleanScorer always returned hits in the correct order, so I lean slightly towards the first approach.
Instead of holding active buckets in a linked list, we could maintain a priority queue ordered by document-id. A simple change as far as I see. This would mean that next() would at least return documents in the correct order and BooleanScorer could implement skipTo (of course not optimized). However, this would off course imply the overhead of keeping the heap of the priority queue in order, compared to the simple list insert. I think the overhead is acceptable.
What about optimizing boolean queries that contain required subqueries? Perhaps it would make sense to treat required subqueries different from the other subqueries in a future boolean scorer. However, I don´t have very concrete ideas yet.
PHRASE AND SPAN Queries ************************** As far as I see, PhraseQueries and PhrasePrefixQueries are completely subsumed by the new SpanQueries. Maybe PhraseQueries and PhrasePrefixQueries could be eliminated at least from version 2.0?
I would like to see the stopWords/positionIncrement problem solved for 1.4. You remember that Erik introduced a position increment for stop words, and then the PhraseQueries didn´t work as expected. Currently, phrases match across stop words, since positions of stop words are ignored. I think PhraseQuery, PhrasePrefixQuery, and SpanNearQuery could be changed (give relative positions for each part in the constructor), however I am not completely sure whether my assumption is correct.
I studied the Span-code today. Thought it would be a good time for doing it since I had already loaded the query stuff into my head from yesterday´s debugging :-)
I have some questions/remarks that might be worth considering:
*) SpanScorer: public boolean next() throws IOException { if (firstTime) { more = spans.next(); firstTime = false; }
if (!more) return false;
freq = 0.0f; doc = spans.doc();
while (more && doc == spans.doc()) { int matchLength = spans.end() - spans.start(); freq += getSimilarity().sloppyFreq(matchLength); more = spans.next(); }
return more || freq != 0.0f; }
If freq == 0 shouldn´t we switch to the next document within next()? This is the way PhraseScorer handles this case.
Furthermore, shouldn´t the while-loop in skipTo be identical to the one in next()? However, instead of doc, it compares target with spans.doc()!
*) SpanTermQuery: Shouldn´t start and end of its Spans always return the same value, so that in the special case of a single SpanTermQuery freq becomes the usual value of 1 with DefaultSimilarity?
*) SpanQuery: Collection getTerms()
Wouldn´t it be better to use Set internally in ordet to avoid dublicates, or do we need dublicates?
*) SpanOrQuery: all no longer needed when everything is in queue. However is kept up to date.
*) SpanNearQuery.getSpans(...): I don´t understand the case clauses.size == 0.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]