Yonik Seeley wrote:
Setup info & Stats:
- 4.3M documents, 12 keyword fields per document, 11
[ ... ]
"field1:4 AND field2:188453 AND field3:1"

field1:4      done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1      done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).

The "field1:4" clause is probably dominating the cost of query execution. Clauses which match large portions of the collection are slow to evaluate. If there are not too many different such clauses then you can optimize this by re-using a Filter in place of such clauses, typically a QueryFilter.


For example, Nutch automatically translates such clauses into QueryFilters. See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup

Note that this only converts clauses whose boost is zero. Since filters do not affect ranking we can only safely convert clauses which do not contribute to the score, i.e, those whose boost is zero. Scores might still be different in the filtered results because of Similarity.coord(). But, in Nutch, Similarity.coord() is overidden to always return 1.0, so that the replacement of clauses with filters does not alter the final scores at all.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to