I discovered the source of my problem.

The web server reports back 302 with the header field as:
location: url
but Nutch expects
Location: url

I fixed my problme by adding a few lines to Http.java

Specifics and Patch:
--------------------

In package org.apache.nutch.protocol.http, class Http handles the 302 response as:

url = new URL(url,response.getHeader("Location"));

so it ties to match the string "Location" to get the redirect but the server I am dealing with reports back "location"

My patch is to change:

File: Http.java
Dist: 0.7.1 and 0.7.2
dir location from Nutch download: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
method: public ProtocolOutput getProtocolOutput(FetchListEntry fle)
below:
} else if (code >= 300 && code < 400) {   // handle redirect

Change:

url = new URL(url,response.getHeader("Location"));

to:

String loc = response.getHeader("Location");
if (loc == null){
   loc=response.getHeader("location");
}
url = new URL(url,loc);

This fixes the specific problem I have been having.

HTH

Yuzo



--On Wednesday, June 14, 2006 6:04 PM -0700 Yuzo Kanomata <[EMAIL PROTECTED]> wrote:

Hi,

I'm using Nutch 0.7.2 and ran into this problem.

I was attempting to crawl some pages starting with a url that the
webserver sends back a 302 redirect. Instead of retrieving the location
line and crawling there, Nutch seems not to be able to get the correct
content out.

I checked some values in Fetcher.java.

Specifically in the run method the values of protocol status is 16
(EXCEPTION) and protocol content is null during the fetch cycle:

Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {

This seems odd in that I would expect a 302 to generate a
ProtocolStatus.MOVED value.

Is there a mistake I'm doing like not enabling a plugin?

I have the conf/nutch-default.xml plugin property set at:

<property>
  <name>plugin.includes</name>
 <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml
| pdf|msword)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin.
By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

The url I am starting from is:

<http://proxy.arts.uci.edu/gamelab/portal/>

I did both a telnet http GET of the url and ran a tcpdump on the
webserver and it is reporting back 302 with a valid location field.

Thanks,

Yuzo




Reply via email to