I know it's possible to switch it off. But I need it, and the question was how to get the exact number of hits after "grouping". The unclean workaround was the only thing I did find yet: - one hit per page - going to page 99999 - see where we end up - cache that number
Works but is ugly :-) Stefan Stefan Groschupf wrote: > I see you mean grouping by host. > Yes that works different and is difficult. > If you like you can switch off grouping by host. > Stefan > > > Am 31.05.2006 um 00:10 schrieb Stefan Neufeind: > >> Hi Stefan, >> >> I didn't mean duplicate in the sense of "two times the same result" - >> but in the sense of "show only XX results per website", e.g. only to >> shoow max two pages of a website that might match. And you can't dedup >> that before the search (runtime) because you don't know what was >> actually searched. I'm refering to the hitsPerSite-parameter of the >> webinterface - while in the source it's called a bit more general (there >> are variables like dedupField etc.). >> >> >> Regards, >> Stefan >> >> Stefan Groschupf wrote: >>> Hi, >>> why not dedub your complete index before and not until runtime? >>> There is a dedub tool for that. >>> >>> Stefan >>> >>> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind: >>> >>>> Hi Eugen, >>>> >>>> what I've found (and if I'm right) is that the page-calculation is done >>>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_ >>>> results when you only need the first page, I guess currently this is >>>> not >>>> done at the moment. However, since I also needed the exact number, I >>>> did >>>> find out the "dirty hack" at least. That helps for the moment. >>>> But as it might take quite a while to find out the exact number of >>>> pages >>>> I suggest that e.g. you compose a "hash" or the words searched for, and >>>> maybe to be sure the number of non-dedupped searchresults, so you don't >>>> have to search the exact number again and again when moving between >>>> pages. >>>> >>>> >>>> Hope that helps, >>>> Stefan >>>> >>>> Eugen Kochuev wrote: >>>>> >>>>> And did you manage to locate the place where the filtering on per >>>>> site basis is done? Is it possible to tweak nutch to make it telling >>>>> the exact number of pages after filtering or is there a problem? >>>>> >>>>>> I've got a pending nutch-issue on this >>>>>> http://issues.apache.org/jira/browse/NUTCH-288 >>>>> >>>>>> A dirty workaround (though working) is to do a search with one hit >>>>>> per >>>>>> page and start-index as 99999. That will give you the actual >>>>>> start-index >>>>>> of the last item, which +1 is the number of results you are looking >>>>>> for. >>>>>> Since requesting the last page takes a bit resources, you might >>>>>> want to >>>>>> cache that result actually - so users searching again or navigating >>>>>> through pages get the number of pages faster. >>>>> >>>>>> PS: For the OpenSearch-connector to not throw an exception but to >>>>>> return the last page, please apply the patch I attached to the bug. ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
