Re: Stop search process when a given number of hits is reached

Andrzej Bialecki Thu, 07 Aug 2008 05:30:15 -0700

Doron Cohen wrote:

Nothing built in that I'm aware of will do this, but it can be done by
searching with your own HitCollector.
There is a related feature - stop search after a specified time - using
TimeLimitedCollector.
It is not released yet, see issue LUCENE-997.
In short, the collector's collect() method is invoked in the search process
for each matching document.
Once 500 docs were collected, your collector can cause the search to stop by
throwing an exception.
Upon catching the exception you know that 500 docs were collected.


Two additional comments:

* the topN results from such incomplete search may be way off, if therewere some high scoring documents somewhere beyond the limit.

* if you know that there are more important and less important documentsin your corpus, and their relative weight is independent of the query(e.g. PageRank-type score), then you can restructure your index so thatpostings belonging to highly-scoring documents come first on the postinglists - this way you have a better chance to collect highly relevantdocuments first, even though the search is incomplete. You can find animplementation of this concept in Nutch(org.apache.nutch.indexer.IndexSorter).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stop search process when a given number of hits is reached

Reply via email to