That makes sense. I should be more precise in that all I need is 100 of the 10000 "reasonable" results.
The concern I would have with a TopDocCollector is that this is biased towards the top of the index which translates for me into a bias for older documents. I'd prefer no age bias or a newer document bias. So I'll see what I can do with a "BottomDocCollector" :-) Tim On 12/4/08 12:39 PM, "Erick Erickson" <[EMAIL PROTECTED]> wrote: > The problem here is how *could* a system return even the top > 10,000 results without scoring them all? What if the millionth > hit resulted in the very best match in the entire corpus? > > That said, sorting may well be the issue here rather than scoring. > You can use a TopDocCollector to get the top N matches (unsorted) > and then do something like use the FieldSortedHitQueue to sort > those N matches, leaving out all the rest of the matches. Note > this assumes that when you say "sorting" you mean sorting > by something other than relevance..... > > Hope this helps > Erick > > On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTED]> wrote: > >> Hi all, >> >> I have an interesting problem with my query traffic. Most of the queries >> run >> in a fairly short amount of time (< 100ms) but a few take over 1000ms. >> These >> queries are predominantly those with a huge number of hits (>1 million hits >> in a >100 million document index). The time taken (as far as I can tell) is >> for lucene to sit there while it scores and sorts all these results. >> >> However it turns out these queries really don¹t have top results. That is, >> of the million documents, there are easily 10000 which are decent results >> (basically those above some threshold score). Frankly, just returning some >> consistent (so paging and reload work) but >> otherwise arbitrary ranking of these 10000 results would be more than good >> enough. >> >> It seems to me that a solution would be to impose some sort of >> pseudo-random >> filter (e.g. consider only every n-th document assuming they are uniformly >> distributed). I¹m wondering if anyone else has experience with this sort of >> issue and what solutions they have found to work well in practice. >> >> Thanks, >> >> Tim >> --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
