Bugs item #978614, was opened at 2004-06-24 01:26
Message generated for change (Comment added) made by aronsson
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=978614&group_id=59548

Category: fetcher
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Lars Aronsson (aronsson)
Assigned to: Nobody/Anonymous (nobody)
Summary: Redirect to local URL "MalformedURLException: no protocol"

Initial Comment:
This website makes a redirect to a local URL
("/filename"), but the nutch crawler wants a protocol
at the beginning of the redirect URL (i.e.
"http://domain/filename";). From nutch's output:

040624 011812 fetching
http://susning.nu/Kurt_Vonnegut/Slakthus_5
040624 011812 fetch of
http://susning.nu/Kurt_Vonnegut/Slakthus_5 failed with:
java.net.MalformedURLException: no protocol:
/susning.fcgi?action=browse&id=Bok/Slakthus_5&oldid=Kurt_Vonnegut/Slakthus_5

That's my website. You can try that URL for a test case.

This happened with nutch-0.4 and the command line
"nutch crawl urls -delay 3 -depth 3"

----------------------------------------------------------------------

>Comment By: Lars Aronsson (aronsson)
Date: 2004-06-24 21:29

Message:
Logged In: YES 
user_id=175880

Thinking of this again, the constructor solution is not the
right one. The redirect to local URL has contents, so there
is really no need to do a new fetch. Instead, the received
contents should be accepted but cataloged under the URL
given by the Location field.

Doing a new fetch from the new URL does no harm, though. But
it is unnecessary.

----------------------------------------------------------------------

Comment By: Lars Aronsson (aronsson)
Date: 2004-06-24 02:31

Message:
Logged In: YES 
user_id=175880

To clarify, the new line 118 should read:

        target = new URL(target,
response.getHeader("Location"));


----------------------------------------------------------------------

Comment By: Lars Aronsson (aronsson)
Date: 2004-06-24 02:30

Message:
Logged In: YES 
user_id=175880

The best place to fix this bug seems to be in file
src/java/net/nutch/net/protocols/http/Http.java
function getResponse()
on line 118, where the old url should be added as the first
argument to the java.net.URL constructor. This url is the
context within which the value of the Location field should
be parsed.

See the constructor documentation at
http://stein.cshl.org/jade/distrib/docs/java.net.URL.html#URL(java.net.URL,%20java.lang.String)


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=978614&group_id=59548


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to