Sorry for off-topic, but how do you make Nutch-0.9 search multiple indexes?
On Thu, Sep 25, 2008 at 4:42 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > If you are using more than one index then dedup will not work across > indexes. A single index should dedup correctly unless the pages are not > exact duplicates but near duplicates. The dedup process works on url and > byte hash. If the content is even 1 byte different, it doesn't work. > > Near duplicate detection is another set of algorithms that haven't been > implemented in Nutch yet. On the query site you can set hte hitsPerSite to > 1 and it should limit your search results. > > Dennis > > Edward Quick wrote: >> >> Hi, >> >> Eventhough I ran nutch dedup on my index, I still have pages with >> different urls but the exactly the same content (see search result example >> below). From what I read up on dedup this shouldn't happen though as it >> deletes the url with the lowest score. Is there anything else I can try to >> get rid of these? >> >> Thanks, >> Ed. >> >> Item Document :- Client - TeraTerm Pro >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards >> Online Employee Self Service ESS Home ... Description Document >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where >> printing or keymapping is an issue, TeraTerm ... >> >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument >> (cached) (explain) (anchors) >> >> >> >> Item Document :- Client - TeraTerm Pro >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards >> Online Employee Self Service ESS Home ... Description Document >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where >> printing or keymapping is an issue, TeraTerm ... >> >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument >> (cached) (explain) (anchors) >> _________________________________________________________________ >> Make a mini you and download it into Windows Live Messenger >> http://clk.atdmt.com/UKM/go/111354029/direct/01/ > -- with best regards, David Jashi Web development EO, Caucasus Online +995(32)970368 [EMAIL PROTECTED] პატივისცემით, დავით ჯაში ვებ–განვითარების დირექტორი "კავკასუს ონლაინი" +995(32)970368 [EMAIL PROTECTED]
