I discovered the source of my problem.
The web server reports back 302 with the header field as:
location: url
but Nutch expects
Location: url
I fixed my problme by adding a few lines to Http.java
Specifics and Patch:
--------------------
In package org.apache.nutch.protocol.http, class Http handles the 302
response as:
url = new URL(url,response.getHeader("Location"));
so it ties to match the string "Location" to get the redirect but the
server I am dealing with reports back "location"
My patch is to change:
File: Http.java
Dist: 0.7.1 and 0.7.2
dir location from Nutch download:
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
method: public ProtocolOutput getProtocolOutput(FetchListEntry fle)
below:
} else if (code >= 300 && code < 400) { // handle redirect
Change:
url = new URL(url,response.getHeader("Location"));
to:
String loc = response.getHeader("Location");
if (loc == null){
loc=response.getHeader("location");
}
url = new URL(url,loc);
This fixes the specific problem I have been having.
HTH
Yuzo
--On Wednesday, June 14, 2006 6:04 PM -0700 Yuzo Kanomata
<[EMAIL PROTECTED]> wrote:
Hi,
I'm using Nutch 0.7.2 and ran into this problem.
I was attempting to crawl some pages starting with a url that the
webserver sends back a 302 redirect. Instead of retrieving the location
line and crawling there, Nutch seems not to be able to get the correct
content out.
I checked some values in Fetcher.java.
Specifically in the run method the values of protocol status is 16
(EXCEPTION) and protocol content is null during the fetch cycle:
Protocol protocol = ProtocolFactory.getProtocol(url);
ProtocolOutput output = protocol.getProtocolOutput(fle);
ProtocolStatus pstat = output.getStatus();
switch(pstat.getCode()) {
This seems odd in that I would expect a 302 to generate a
ProtocolStatus.MOVED value.
Is there a mistake I'm doing like not enabling a plugin?
I have the conf/nutch-default.xml plugin property set at:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml
| pdf|msword)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
The url I am starting from is:
<http://proxy.arts.uci.edu/gamelab/portal/>
I did both a telnet http GET of the url and ran a tcpdump on the
webserver and it is reporting back 302 with a valid location field.
Thanks,
Yuzo