Hi Eugen,

what I've found (and if I'm right) is that the page-calculation is done
in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
results when you only need the first page, I guess currently this is not
done at the moment. However, since I also needed the exact number, I did
find out the "dirty hack" at least. That helps for the moment.
But as it might take quite a while to find out the exact number of pages
I suggest that e.g. you compose a "hash" or the words searched for, and
maybe to be sure the number of non-dedupped searchresults, so you don't
have to search the exact number again and again when moving between pages.


Hope that helps,
 Stefan

Eugen Kochuev wrote:
> 
> And did you manage to locate the place where the filtering on per
> site basis is done? Is it possible to tweak nutch to make it telling
> the exact number of pages after filtering or is there a problem?
> 
>> I've got a pending nutch-issue on this
>> http://issues.apache.org/jira/browse/NUTCH-288
> 
>> A dirty workaround (though working) is to do a search with one hit per
>> page and start-index as 99999. That will give you the actual start-index
>> of the last item, which +1 is the number of results you are looking for.
>> Since requesting the last page takes a bit resources, you might want to
>> cache that result actually - so users searching again or navigating
>> through pages get the number of pages faster.
> 
>> PS: For the OpenSearch-connector to not throw an exception but to return
>> the last page, please apply the patch I attached to the bug.

Reply via email to