Hi everybody,
Can someone shine a light on NUTCH-124:
RobotRulesParser.java doesn't follow redirects when requesting the
robots.txt file. Doug patched this, but that didn't make it to the
trunk.
What is the wished behavior here?
For example, when requesting the following url:
http://7is7.com/software/stateye/download/stateye097f.html
... RobotRulesParser requests the following robots.txt:
http://7is7.com/robots.txt
... however, that file doesn't exist, it redirects to:
http://www.7is7.com/robots.txt
... that robots.txt tells us the initial url is disallowed.
But does it really? Or is robots.txt file only applicable to http://www.7is7.com
and not http://7is7.com.
So the question is: should we follow such redirects?
Thanks,
Mathijs