Hi, I haven't used it myself but it looks like the *dedup* command ( http://wiki.apache.org/nutch/bin/nutch_dedup) uses the signature of the documents to remove duplicates. That should work fine in the case you are describing in combination with Jasper's suggestion which would prevent fetching some of the duplicates in the first place
Julien -- DigitalPebble Ltd http://www.digitalpebble.com > On Oct 8, 2008, at 4:23 AM, Detlef Müller-Solger wrote: > > Hi, >> >> in Germany it is reported, that one big show stopper for Nutch is the >> fact, that there for example are often identical webpage's which can be >> addressed by different URLs. For example by requesting >> >> www.xyz.de/information >> or by >> www.xyz.de/information/ >> or by >> www.xyz.de/information/index >> >> From my point of view due to the different URLs Nutch is indexing those >> webpages unfortuneately three times. Is there a method to avoid the indexing >> of these doublets? For example by comparing all information of the webpage >> excluding the URL. >> >> Note: A Fliter like "reduce URL generally of "/index"" is no solution >> because in other cases of the same run "/index" may be needed or the same >> Webpage can be adressed also by other URL Syntax. >> >> Thanx >> >> Detlef Müller-Solger >> >> >> >
