[jira] Commented: (NUTCH-926) Nutch follows wrong url in

Andrzej Bialecki (JIRA) Wed, 27 Oct 2010 13:45:46 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925543#action_12925543
 ]


Andrzej Bialecki  commented on NUTCH-926:
-----------------------------------------

bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!!
No need to shout, we hear you :)

Indeed, Nutch behavior when following redirects doesn't play well with the rule 
of ignoring external outlinks. Strictly speaking, redirects are not outlinks, 
but the silent assumption behind ignoreExternalOutlinks is that we crawl 
content only from that hostname.

And your patch would solve this particular issue. However, this is not as 
simple as it seems... My favorite example is www.ibm.com -> 
www8.ibm.com/index.html . If we apply your fix you won't be able to crawl 
www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple 
equality of hostnames may not be sufficient. We have utilities to extract 
domain names, so we could compare domains but then we may mistreat 
money.cnn.com vs. weather.cnn.com ...

> Nutch follows wrong url in <META http-equiv="refresh" tag
> ---------------------------------------------------------
>
>                 Key: NUTCH-926
>                 URL: https://issues.apache.org/jira/browse/NUTCH-926
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: gnu/linux centOs
>            Reporter: Marco Novo
>            Priority: Critical
>             Fix For: 1.3
>
>         Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed 
> domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WRONG.RIGHTDOMAIN.COM";>
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> <title></title>
>     <META http-equiv="refresh" content="0;
>     url=http://WWW.WRONGDOMAIN.COM";>
> </head>
> <body>
> </body>
> </html>
> Nutch continues to crawl the WRONG domain! But it should not do this! If that 
> we will spider all the web....
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have 
> done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-926) Nutch follows wrong url in

Reply via email to