I know it's possible to switch it off. But I need it, and the question
was how to get the exact number of hits after "grouping". The unclean
workaround was the only thing I did find yet:
- one hit per page
- going to page 99999
- see where we end up
- cache that number

Works but is ugly :-)

  Stefan

Stefan Groschupf wrote:
> I see you mean grouping by host.
> Yes that works different and is difficult.
> If you like you can switch off grouping by host.
> Stefan
> 
> 
> Am 31.05.2006 um 00:10 schrieb Stefan Neufeind:
> 
>> Hi Stefan,
>>
>> I didn't mean duplicate in the sense of "two times the same result" -
>> but in the sense of "show only XX results per website", e.g. only to
>> shoow max two pages of a website that might match. And you can't dedup
>> that before the search (runtime) because you don't know what was
>> actually searched. I'm refering to the hitsPerSite-parameter of the
>> webinterface - while in the source it's called a bit more general (there
>> are variables like dedupField etc.).
>>
>>
>> Regards,
>>  Stefan
>>
>> Stefan Groschupf wrote:
>>> Hi,
>>> why not dedub your complete index before and not until runtime?
>>> There is a dedub tool for that.
>>>
>>> Stefan
>>>
>>> Am 29.05.2006 um 21:20 schrieb Stefan Neufeind:
>>>
>>>> Hi Eugen,
>>>>
>>>> what I've found (and if I'm right) is that the page-calculation is done
>>>> in Lucene. As it is quite "expensive" (time-consuming) to dedup _all_
>>>> results when you only need the first page, I guess currently this is
>>>> not
>>>> done at the moment. However, since I also needed the exact number, I
>>>> did
>>>> find out the "dirty hack" at least. That helps for the moment.
>>>> But as it might take quite a while to find out the exact number of
>>>> pages
>>>> I suggest that e.g. you compose a "hash" or the words searched for, and
>>>> maybe to be sure the number of non-dedupped searchresults, so you don't
>>>> have to search the exact number again and again when moving between
>>>> pages.
>>>>
>>>>
>>>> Hope that helps,
>>>>  Stefan
>>>>
>>>> Eugen Kochuev wrote:
>>>>>
>>>>> And did you manage to locate the place where the filtering on per
>>>>> site basis is done? Is it possible to tweak nutch to make it telling
>>>>> the exact number of pages after filtering or is there a problem?
>>>>>
>>>>>> I've got a pending nutch-issue on this
>>>>>> http://issues.apache.org/jira/browse/NUTCH-288
>>>>>
>>>>>> A dirty workaround (though working) is to do a search with one hit
>>>>>> per
>>>>>> page and start-index as 99999. That will give you the actual
>>>>>> start-index
>>>>>> of the last item, which +1 is the number of results you are looking
>>>>>> for.
>>>>>> Since requesting the last page takes a bit resources, you might
>>>>>> want to
>>>>>> cache that result actually - so users searching again or navigating
>>>>>> through pages get the number of pages faster.
>>>>>
>>>>>> PS: For the OpenSearch-connector to not throw an exception but to
>>>>>> return the last page, please apply the patch I attached to the bug.


-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to