Hi Mathijs, I've posted a patch for this on https://issues.apache.org/jira/browse/NUTCH-731
HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/3/17 Mathijs Homminga <mathijs.hommi...@gmail.com> > Hi everybody, > > Can someone shine a light on NUTCH-124: > RobotRulesParser.java doesn't follow redirects when requesting the > robots.txt file. Doug patched this, but that didn't make it to the trunk. > What is the wished behavior here? > > > For example, when requesting the following url: > http://7is7.com/software/stateye/download/stateye097f.html > > ... RobotRulesParser requests the following robots.txt: > http://7is7.com/robots.txt > > ... however, that file doesn't exist, it redirects to: > http://www.7is7.com/robots.txt > > ... that robots.txt tells us the initial url is disallowed. > But does it really? Or is robots.txt file only applicable to > http://www.7is7.com and not http://7is7.com. > > So the question is: should we follow such redirects? > > Thanks, > Mathijs >