Doug Cutting wrote:
for (int i = 0; i < numResults; i++) {
   ids[i] = Long.parseLong((hits.doc(i)).get("messageID"));
}

This is not a recommended way to use Lucene. The intent is that you should only have to call Hits.doc() for documents that you actually display, usually around 10 per query. Is this still a bottleneck when you fetch a max of 10 or 20 documents?

I didn't test this case.

So I'd be interested to hear why you need 1500 hits. My guess is that you're doing post-processing of hits, then selecting 10 or so to actually display. If you can figure out a way to do this post processing without accessing the document object, i.e., through the query, a custom HitCollector, or the SearchBean, then this optimization is probably not needed.

We would dearly love to not have to post-process results returned from lucene. Unfortunately, we can't foresee a way to do this given the current architecture of our applications and Lucene. The issue is that we must both exclude search results based upon an external (to lucene) permission system and be able to sort results based upon criteria(s) that again can't be stored inside lucene (document rating is an example). Neither the permissions nor the external sort criteria(s) can be stored in lucene because they can impact too many documents when they change (1 permission change could require 'updating' a field in every document in the lucene store) or change too often (it's quite probable that a document rating will change every time a document is viewed for example).

The only way I foresee that we could internalize both of these factors into lucene is if it was possible to modify a document inside of lucene at basically no cost. Since that's not currently possible, we are stuck with retrieving all the documents from lucene and post-processing them. Even if updating a document was possible we might decide that it's just not worth it to store some document attributes in lucene from an overall performance perspective. There may of course be other possible solutions however we haven't yet thought of them

A 30% optimization to a slow algorithm is better than nothing, but it would be better yet to improve the algorithm. That said, this sort of improvement is not always trivial, and lots of people use Lucene in the way that you have, so it's still may be worth optimizing this.

30% on my machine - I think it's likely to be quite a bit faster when the lucene files are stripped across multiple disks. I can't test that assumption though as I don't have the hardware available. I believe the speedup is beneficial in almost all situations and the cost associated with the optimization is quite minimal, especially when compared to the alternative (slow searches under heavy load or more memory usage/file descriptors through multiple readers).


Regards,

Bruce Ritchie

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature



Reply via email to