Hi Stefan, I didn't mean duplicate in the sense of "two times the same result" - but in the sense of "show only XX results per website", e.g. only to shoow max two pages of a website that might match. And you can't dedup that before the search (runtime) because you don't know what was actually searched. I'm refering to the hitsPerSite-parameter of the webinterface - while in the source it's called a bit more general (there are variables like dedupField etc.).
Regards, Stefan Stefan Groschupf wrote: > Hi, > why not dedub your complete index before and not until runtime? > There is a dedub tool for that. > > Stefan > > Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: > >> Hi Eugen, >> >> what I've found (and if I'm right) is that the page-calculation is done >> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ >> results when you only need the first page, I guess currently this is not >> done at the moment. However, since I also needed the exact number, I did >> find out the "dirty hack" at least. That helps for the moment. >> But as it might take quite a while to find out the exact number of pages >> I suggest that e.g. you compose a "hash" or the words searched for, and >> maybe to be sure the number of non-dedupped searchresults, so you don't >> have to search the exact number again and again when moving between >> pages. >> >> >> Hope that helps, >> Stefan >> >> Eugen Kochuev wrote: >>> >>> And did you manage to locate the place where the filtering on per >>> site basis is done? Is it possible to tweak nutch to make it telling >>> the exact number of pages after filtering or is there a problem? >>> >>>> I've got a pending nutch-issue on this >>>> http://issues.apache.org/jira/browse/NUTCH-288 >>> >>>> A dirty workaround (though working) is to do a search with one hit per >>>> page and start-index as 99999. That will give you the actual >>>> start-index >>>> of the last item, which +1 is the number of results you are looking >>>> for. >>>> Since requesting the last page takes a bit resources, you might want to >>>> cache that result actually - so users searching again or navigating >>>> through pages get the number of pages faster. >>> >>>> PS: For the OpenSearch-connector to not throw an exception but to >>>> return the last page, please apply the patch I attached to the bug.
