Hi Eugen, what I've found (and if I'm right) is that the page-calculation is done in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ results when you only need the first page, I guess currently this is not done at the moment. However, since I also needed the exact number, I did find out the "dirty hack" at least. That helps for the moment. But as it might take quite a while to find out the exact number of pages I suggest that e.g. you compose a "hash" or the words searched for, and maybe to be sure the number of non-dedupped searchresults, so you don't have to search the exact number again and again when moving between pages.
Hope that helps, Stefan Eugen Kochuev wrote: > > And did you manage to locate the place where the filtering on per > site basis is done? Is it possible to tweak nutch to make it telling > the exact number of pages after filtering or is there a problem? > >> I've got a pending nutch-issue on this >> http://issues.apache.org/jira/browse/NUTCH-288 > >> A dirty workaround (though working) is to do a search with one hit per >> page and start-index as 99999. That will give you the actual start-index >> of the last item, which +1 is the number of results you are looking for. >> Since requesting the last page takes a bit resources, you might want to >> cache that result actually - so users searching again or navigating >> through pages get the number of pages faster. > >> PS: For the OpenSearch-connector to not throw an exception but to return >> the last page, please apply the patch I attached to the bug.
