[ 
http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ] 

Fuad Efendi commented on NUTCH-124:
-----------------------------------

Is such behavior defined in Robots Exclusion Protocol? 
http://www.robotstxt.org/ If so, it should be some kind of a new field in 
robots.txt in a source site! Such as
Redirect-Disallow: Nutch

Just compare with Nutch behavior when one site has a link to a page on a second 
site, and second one has "Disallow" for this page. Nutch handles it correctly. 
It uses Robots.txt file from the same site as the web page. 

Robots.txt MUST NOT define behavior for foreign sites.


> protocol-httpclient does not follow redirects when fetching robots.txt
> ----------------------------------------------------------------------
>
>          Key: NUTCH-124
>          URL: http://issues.apache.org/jira/browse/NUTCH-124
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.8-dev, 0.7.2-dev
>     Reporter: Doug Cutting
>      Fix For: 0.8-dev

>
> If a site's robots.txt redirects, protocol-httpclient does not correctly 
> fetch the robots.txt and effectively ignores it for the site.  See 
> http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to