[
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982276#comment-13982276
]
Sebastian Nagel commented on NUTCH-566:
---------------------------------------
Linked to duplicate issues NUTCH-797 and NUTCH-952.
> Sun's URL class has bug in creation of relative query URLs
> ----------------------------------------------------------
>
> Key: NUTCH-566
> URL: https://issues.apache.org/jira/browse/NUTCH-566
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: MacOS X and Linux (CentOS 4.5) both
> Reporter: Doug Cook
> Priority: Minor
> Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions as well.
> Relative links of the form "?blah" are resolved incorrectly. For example,
> with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link
> of "?id_entrep=111", Nutch will resolve this pair to the link
> "http://www.fleurie.org/?id_entrep=111". No such URL exists, and all browsers
> I tried will resolve the pair to
> "http://www.fleurie.org/entreprise.asp?id_entrep=111".
> I tracked this down to what could be called a bug in Sun's URL class.
> According to Sun's spec, they parse the relative URL according to RFC 2396.
> But the original RFC for relative links was RFC 1808, and the two RFCs differ
> in how they handle relative links beginning with "?". Most browsers
> (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for
> compatibility and also because the behavior makes more sense). Apparently
> even the people that wrote RFC 2396 recognized that this was a mistake, and
> the specified behavior was changed in RFC 3986 to match what browsers do.
> For a discussion of this, see
> http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
> Sun's URL implementation, however, still implements RFC2396, as far as I can
> tell, and is out of step with the rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching. It is a static method
> to create URLs which behaves exactly as new URL(URL base, String
> relativePath), and I use it as a drop-in replacement for that in
> DOMContentUtils, Javascript link extraction, etc. Obviously, it really only
> matters wherever links are extracted. I haven't included the calling code
> from DOMContentUtils, etc. because my local versions are largely rewritten,
> but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but obviously feel free to
> move it to another place if you feel it belongs there!
--
This message was sent by Atlassian JIRA
(v6.2#6252)