Dear Piotr,
These pages are not identical. There are different links, and off
course advertisements.
I use your great patch for nutch-7 ;), that removes identical pages.
I waiting for your new patch (www.cnn.com, cnn.com), because this will
solve 90% of these problems. I think there aren't any other idea to
solve the nutch-70 problem.
I think there are not any pronbem with lost of anchor texts.
Thanks for your great patchs, Ferenc
Piotr Kosiorowski wrotte:
Hello Ferenc,
If the pages are really identical they can removed using "nutch dedup"
command. If not (sometimes such pages differ by some date, counter or
advertisement) - currently there is no such tool that makes it
possible to remove them. I am working on simple tool to remove
duplicates like
http://www.cnn.com/ and http://cnn.com (that differ only in "www") but
at this stage it is rather a hack - it removes it from an Lucene index
but all anchor text for removed page is lost and WebDB is not updated.
Regards
Piotr
Lutischán Ferenc (JIRA) wrote:
duplicate pages - virtual hosts in db.
--------------------------------------
Key: NUTCH-70
URL: http://issues.apache.org/jira/browse/NUTCH-70
Project: Nutch
Type: Bug
Environment: 0,7 dev
Reporter: Lutischán Ferenc
Dear Developers,
I have a problem with nutch:
- There are many sites duplicates in the webdb and in the segments.
The source of this problem is:
- If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu,
origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages
are the same, only the inlinks are differents.
- The ip address is the same.
- When search, all virtualhosts are in the results.
Google only show one of these virtual hosts, the nutch show all. The
result nutch db is larger, and this case slower, than google.
Have any idea, how to remove these duplicates?
Regards,
Ferenc
-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP,
AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers