Do you have a dataset and queries I can test on? On Dec 10, 2007 1:16 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Shai Erera wrote: > > > No - I didn't try to populate an index with real data and run real > > queries > > (what is "real" after all?). I know from my experience of indexes with > > several millions of documents where there are queries with several > > hundred > > thousands results (one query even hit 2.5 M documents). This is > > typical in > > search: users type on average 2.3 terms in a query. The chances > > you'd hit a > > query with huge result set are not that small in such cases (I'm > > not saying > > this is the most common case though, I agree that most of the > > searches don't > > process that many documents). > > Agreed: many queries do hit a great many results. But I agree with > Paul: > it's not clear how this "typically" translates into how many ScoreDocs > get created? > > > However, this change will improve performance from the algorithm > > point of > > view - you allocate as many as numRequestedHits+1 no matter how many > > documents your query processes. > > It's definitely a good step forward: not creating extra garbage in hot > spots is worthwhile, so I think we should make this change. Still I'm > wondering how much this helps in practice. > > I think benchmarking on "real" use cases (vs synthetic tests) is > worthwhile: it keeps you focused on what really counts, in the end. > > In this particular case there are at least 2 things it could show us: > > * How many ScoreDocs really get created, or, what %tg of hits > actually result in an insertion into the PQ? > > * How much is this savings as a %tg of the overall time spent > searching? > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Shai Erera
