fether handling on 302 redirect

Yuzo Kanomata Wed, 14 Jun 2006 18:05:41 -0700

Hi,

I'm using Nutch 0.7.2 and ran into this problem.

I was attempting to crawl some pages starting with a url that the webserversends back a 302 redirect. Instead of retrieving the location line andcrawling there, Nutch seems not to be able to get the correct content out.


I checked some values in Fetcher.java.

Specifically in the run method the values of protocol status is 16(EXCEPTION) and protocol content is null during the fetch cycle:


Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {

This seems odd in that I would expect a 302 to generate aProtocolStatus.MOVED value.


Is there a mistake I'm doing like not enabling a plugin?

I have the conf/nutch-default.xml plugin property set at:

<property>
 <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html
|pdf|msword)|index-basic|query-(basic|site|url)</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

The url I am starting from is:

<http://proxy.arts.uci.edu/gamelab/portal/>

I did both a telnet http GET of the url and ran a tcpdump on the webserverand it is reporting back 302 with a valid location field.


Thanks,

Yuzo

fether handling on 302 redirect

Reply via email to