Hi,

I'm using Nutch 0.7.2 and ran into this problem.

I was attempting to crawl some pages starting with a url that the webserver sends back a 302 redirect. Instead of retrieving the location line and crawling there, Nutch seems not to be able to get the correct content out.

I checked some values in Fetcher.java.

Specifically in the run method the values of protocol status is 16 (EXCEPTION) and protocol content is null during the fetch cycle:

Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {

This seems odd in that I would expect a 302 to generate a ProtocolStatus.MOVED value.

Is there a mistake I'm doing like not enabling a plugin?

I have the conf/nutch-default.xml plugin property set at:

<property>
 <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html
|pdf|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

The url I am starting from is:

<http://proxy.arts.uci.edu/gamelab/portal/>

I did both a telnet http GET of the url and ran a tcpdump on the webserver and it is reporting back 302 with a valid location field.

Thanks,

Yuzo

Reply via email to