[ 
http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ] 

Stefan Neufeind commented on NUTCH-70:
--------------------------------------

Is the content exactly the same? Maybe could the page be checked  against an 
already existing one by an MD5 on the content? But I'm not sure if there is a 
clean way to workaround the problem - what if all pages are the same except 
one, on the other vhost? Would have to crawl all anyway, wouldn't you?

> duplicate pages - virtual hosts in db.
> --------------------------------------
>
>          Key: NUTCH-70
>          URL: http://issues.apache.org/jira/browse/NUTCH-70
>      Project: Nutch
>         Type: Bug

>  Environment: 0,7 dev
>     Reporter: YourSoft

>
> Dear Developers,
> I have a problem with nutch:
> - There are many sites duplicates in the webdb and in the segments.
> The source of this problem is:
> - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, 
> origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the 
> same, only the inlinks are differents.
> - The ip address is the same.
> - When search, all virtualhosts are in the results.
> Google only show one of these virtual hosts, the nutch show all. The result 
> nutch db is larger, and this case slower, than google.
> Have any idea, how to remove these duplicates?
> Regards,
>     Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to