Hi,
I'm using Nutch 0.7.2 and ran into this problem.
I was attempting to crawl some pages starting with a url that the webserver
sends back a 302 redirect. Instead of retrieving the location line and
crawling there, Nutch seems not to be able to get the correct content out.
I checked some values in Fetcher.java.
Specifically in the run method the values of protocol status is 16
(EXCEPTION) and protocol content is null during the fetch cycle:
Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {
This seems odd in that I would expect a 302 to generate a
ProtocolStatus.MOVED value.
Is there a mistake I'm doing like not enabling a plugin?
I have the conf/nutch-default.xml plugin property set at:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html
|pdf|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
The url I am starting from is:
<http://proxy.arts.uci.edu/gamelab/portal/>
I did both a telnet http GET of the url and ran a tcpdump on the webserver
and it is reporting back 302 with a valid location field.
Thanks,
Yuzo