[
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925728#action_12925728
]
Marco Novo commented on NUTCH-926:
----------------------------------
I'm sorry I did not mean to shout, i know you are able to hear me, I was only
despair, and capital letters were used to enhance the visibility of the
problem. :)
>From what I understand, the problem was already known, we should add another
>property (and probably a plugin) to regulate the crawling of Web load balancer
>that have different hostname than the original but which contain relevant data.
But without our patch this time, with some unfortunate redirect outside of the
domain (no load balancer), Nutch could end up downloading the entire web using
high levels of depth ....
> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
> Key: NUTCH-926
> URL: https://issues.apache.org/jira/browse/NUTCH-926
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.2
> Environment: gnu/linux centOs
> Reporter: Marco Novo
> Priority: Critical
> Fix For: 1.3
>
> Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed
> domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
> <META http-equiv="refresh" content="0;
> url=http://WRONG.RIGHTDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
> <META http-equiv="refresh" content="0;
> url=http://WWW.WRONGDOMAIN.COM">
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that
> we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have
> done a patch so we will attach it
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.