> > > > Dennis, > > I am facing same problem, in my crawl content of some urls are > > same but urls are different. Could you please tell me how I can set > > hitsPersite to 1 . ? > > I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all > hits' button). It might be possible to set this in the web.xml or > nutch-site.xml though? > > > > > --Vishal > > > > On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > > > > If you are using more than one index then dedup will not work across > > > indexes. A single index should dedup correctly unless the pages are not > > > exact duplicates but near duplicates. The dedup process works on url and > > > byte hash. If the content is even 1 byte different, it doesn't work. > > > I only have one index, and have only crawled one domain site which is the > Intranet at my work. > The pages definitely seem to be identical. I saved the source from both pages > and the sizes were exactly the same too.
Also, just to add to this I checked the index with Luke which shows the two urls below with the same titles but different timestamps, digests and boosts. :-( > > > > > > > > Near duplicate detection is another set of algorithms that haven't been > > > implemented in Nutch yet. On the query site you can set hte hitsPerSite > > > to > > > 1 and it should limit your search results. > > > > > > Dennis > > > > > > > > > Edward Quick wrote: > > > > > >> Hi, > > >> > > >> Eventhough I ran nutch dedup on my index, I still have pages with > > >> different urls but the exactly the same content (see search result > > >> example > > >> below). From what I read up on dedup this shouldn't happen though as it > > >> deletes the url with the lowest score. Is there anything else I can try > > >> to > > >> get rid of these? > > >> > > >> Thanks, > > >> Ed. > > >> > > >> Item Document :- Client - TeraTerm Pro > > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards > > >> Online Employee Self Service ESS Home ... Description Document > > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: > > >> Unix > > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. > > >> Where > > >> printing or keymapping is an issue, TeraTerm ... > > >> > > >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) > > >> (explain) (anchors) > > >> > > >> > > >> > > >> Item Document :- Client - TeraTerm Pro > > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards > > >> Online Employee Self Service ESS Home ... Description Document > > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: > > >> Unix > > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. > > >> Where > > >> printing or keymapping is an issue, TeraTerm ... > > >> > > >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) > > >> (explain) (anchors) > > >> _________________________________________________________________ > > >> Make a mini you and download it into Windows Live Messenger > > >> http://clk.atdmt.com/UKM/go/111354029/direct/01/ > > >> > > > > > _________________________________________________________________ > Get all your favourite content with the slick new MSN Toolbar - FREE > http://clk.atdmt.com/UKM/go/111354027/direct/01/ _________________________________________________________________ Win New York holidays with Kellogg’s & Live Search http://clk.atdmt.com/UKM/go/111354033/direct/01/
