Re: robots.txt redirect (NUTCH-124)

Julien Nioche Fri, 03 Apr 2009 11:06:06 -0700

Hi Mathijs,

I've posted a patch for this on
https://issues.apache.org/jira/browse/NUTCH-731


HTH

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/3/17 Mathijs Homminga <mathijs.hommi...@gmail.com>

> Hi everybody,
>
> Can someone shine a light on NUTCH-124:
> RobotRulesParser.java doesn't follow redirects when requesting the
> robots.txt file. Doug patched this, but that didn't make it to the trunk.
> What is the wished behavior here?
>
>
> For example, when requesting the following url:
> http://7is7.com/software/stateye/download/stateye097f.html
>
> ... RobotRulesParser requests the following robots.txt:
> http://7is7.com/robots.txt
>
> ... however, that file doesn't exist, it redirects to:
> http://www.7is7.com/robots.txt
>
> ... that robots.txt tells us the initial url is disallowed.
> But does it really? Or is robots.txt file only applicable to
> http://www.7is7.com and not http://7is7.com.
>
> So the question is: should we follow such redirects?
>
> Thanks,
> Mathijs
>

Re: robots.txt redirect (NUTCH-124)

Reply via email to