Hi Stefan,

I didn't mean duplicate in the sense of "two times the same result" -
but in the sense of "show only XX results per website", e.g. only to
shoow max two pages of a website that might match. And you can't dedup
that before the search (runtime) because you don't know what was
actually searched. I'm refering to the hitsPerSite-parameter of the
webinterface - while in the source it's called a bit more general (there
are variables like dedupField etc.).


Regards,
 Stefan

Stefan Groschupf wrote:
> Hi,
> why not dedub your complete index before and not until runtime?
> There is a dedub tool for that.
> 
> Stefan
> 
> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
> 
>> Hi Eugen,
>>
>> what I've found (and if I'm right) is that the page-calculation is done
>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
>> results when you only need the first page, I guess currently this is not
>> done at the moment. However, since I also needed the exact number, I did
>> find out the "dirty hack" at least. That helps for the moment.
>> But as it might take quite a while to find out the exact number of pages
>> I suggest that e.g. you compose a "hash" or the words searched for, and
>> maybe to be sure the number of non-dedupped searchresults, so you don't
>> have to search the exact number again and again when moving between
>> pages.
>>
>>
>> Hope that helps,
>>  Stefan
>>
>> Eugen Kochuev wrote:
>>>
>>> And did you manage to locate the place where the filtering on per
>>> site basis is done? Is it possible to tweak nutch to make it telling
>>> the exact number of pages after filtering or is there a problem?
>>>
>>>> I've got a pending nutch-issue on this
>>>> http://issues.apache.org/jira/browse/NUTCH-288
>>>
>>>> A dirty workaround (though working) is to do a search with one hit per
>>>> page and start-index as 99999. That will give you the actual
>>>> start-index
>>>> of the last item, which +1 is the number of results you are looking
>>>> for.
>>>> Since requesting the last page takes a bit resources, you might want to
>>>> cache that result actually - so users searching again or navigating
>>>> through pages get the number of pages faster.
>>>
>>>> PS: For the OpenSearch-connector to not throw an exception but to
>>>> return the last page, please apply the patch I attached to the bug.

Reply via email to