> Date: Thu, 25 Sep 2008 21:10:52 +0530
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: pages with duplicate content in search results
> 
> Dennis,
>             I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?

I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' 
button). It might be possible to set this in the web.xml or nutch-site.xml 
though?

> 
> --Vishal
> 
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> 
> > If you are using more than one index then dedup will not work across
> > indexes.  A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates.  The dedup process works on url and
> > byte hash.  If the content is even 1 byte different, it doesn't work.


I only have one index, and have only crawled one domain site which is the 
Intranet at my work. 
The pages definitely seem to be identical. I saved the source from both pages 
and the sizes were exactly the same too.


> >
> > Near duplicate detection is another set of algorithms that haven't been
> > implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
> > 1 and it should limit your search results.
> >
> > Dennis
> >
> >
> > Edward Quick wrote:
> >
> >> Hi,
> >>
> >> Eventhough I ran nutch dedup on my index, I still have pages with
> >> different urls but the exactly the same content (see search result example
> >> below). From what I read up on dedup this shouldn't happen though as it
> >> deletes the url with the lowest score. Is there anything else I can try to
> >> get rid of these?
> >>
> >> Thanks,
> >> Ed.
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online   Employee Self Service       ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. 
> >> Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)
> >>  (explain) (anchors)
> >>
> >>
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online   Employee Self Service       ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. 
> >> Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)
> >>  (explain) (anchors)
> >> _________________________________________________________________
> >> Make a mini you and download it into Windows Live Messenger
> >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> >>
> >

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/

Reply via email to