In search.jsp lines 116-119:

  int hitsPerSite = 2;                            // max hits per site
  String hitsPerSiteString = request.getParameter("hitsPerSite");
  if (hitsPerSiteString != null)
    hitsPerSite = Integer.parseInt(hitsPerSiteString);

Hope that helps.

Dennis

vishal vachhani wrote:
Dennis,
            I am facing same problem, in my crawl content of some urls are
same but urls are different. Could you please tell me how I can set
hitsPersite to 1 . ?

--Vishal

On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

If you are using more than one index then dedup will not work across
indexes.  A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates.  The dedup process works on url and
byte hash.  If the content is even 1 byte different, it doesn't work.

Near duplicate detection is another set of algorithms that haven't been
implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
1 and it should limit your search results.

Dennis


Edward Quick wrote:

Hi,

Eventhough I ran nutch dedup on my index, I still have pages with
different urls but the exactly the same content (see search result example
below). From what I read up on dedup this shouldn't happen though as it
deletes the url with the lowest score. Is there anything else I can try to
get rid of these?

Thanks,
Ed.

Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
Online   Employee Self Service       ESS Home ... Description Document
Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
printing or keymapping is an issue, TeraTerm ...

http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)
 (explain) (anchors)



Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
Online   Employee Self Service       ESS Home ... Description Document
Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
printing or keymapping is an issue, TeraTerm ...

http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)
 (explain) (anchors)
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/


Reply via email to