[Nutch-general] Re: getting exact number of matches

Stefan Groschupf Tue, 30 May 2006 16:22:06 -0700

I see you mean grouping by host.
Yes that works different and is difficult.
If you like you can switch off grouping by host.
Stefan



Am 31.05.2006 um 00:10 schrieb Stefan Neufeind:

Hi Stefan,

I didn't mean duplicate in the sense of "two times the same result" -
but in the sense of "show only XX results per website", e.g. only to
shoow max two pages of a website that might match. And you can't dedup
that before the search (runtime) because you don't know what was
actually searched. I'm refering to the hitsPerSite-parameter of the
webinterface - while in the source it's called a bit more general(there
are variables like dedupField etc.).


Regards,
 Stefan

Stefan Groschupf wrote:
Hi,
why not dedub your complete index before and not until runtime?
There is a dedub tool for that.

Stefan

Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
Hi Eugen,
what I've found (and if I'm right) is that the page-calculationis donein Lucene. As it is quite "expensive" (time-consuming) to dedup_all_results when you only need the first page, I guess currently thisis notdone at the moment. However, since I also needed the exactnumber, I did
find out the "dirty hack" at least. That helps for the moment.
But as it might take quite a while to find out the exact numberof pagesI suggest that e.g. you compose a "hash" or the words searchedfor, andmaybe to be sure the number of non-dedupped searchresults, so youdon't
have to search the exact number again and again when moving between
pages.


Hope that helps,
 Stefan

Eugen Kochuev wrote:
And did you manage to locate the place where the filtering on per
site basis is done? Is it possible to tweak nutch to make ittelling
the exact number of pages after filtering or is there a problem?
I've got a pending nutch-issue on this
http://issues.apache.org/jira/browse/NUTCH-288
A dirty workaround (though working) is to do a search with onehit per
page and start-index as 99999. That will give you the actual
start-index
of the last item, which +1 is the number of results you arelooking
for.
Since requesting the last page takes a bit resources, you mightwant tocache that result actually - so users searching again ornavigating
through pages get the number of pages faster.
PS: For the OpenSearch-connector to not throw an exception but to
return the last page, please apply the patch I attached to thebug.




-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: getting exact number of matches

Reply via email to