[
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288280#comment-16288280
]
Sebastian Nagel commented on NUTCH-2478:
----------------------------------------
Confirmed:
{noformat}
$ cat /var/www/html/nutch/test_nutch_2478.html
<html>
<head>
<title>Test NUTCH-2478</title>
<base href="//www.example.com/">
</head>
<body>
<a href="index/produkt/kanaly/">Kanaly</a>
</body>
</html>
$ $NUTCH_HOME/bin/nutch parsechecker http://localhost/nutch/test_nutch_2478.html
...
Outlinks: 1
outlink: toUrl: http://localhost/nutch/index/produkt/kanaly/ anchor: Kanaly
...
{noformat}
But shouldn't the protocol depend on the originally fetched URL. In other
words: whether the base URL and the outlink starts with http:// or https://
depends on the URL fetched.
The HTML standard or at least common sense is to resolve the base URL
([1|https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base],
[2|https://html.spec.whatwg.org/multipage/semantics.html#frozen-base-url]).
What about properly resolving the base URL in parse-html and parse-tika.
Actually, the method getBase(Node root) in DOMContentUtils isn't resolving the
URL found in {{<base href="..."/>}}.
> // is not a valid base URL
> --------------------------
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.13
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
> @Test
> public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
>
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/",
> abs.toString());
> }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If
> current URL is not /, its path will be prepended, resulting in 404 being
> crawled.
> This ticket must allow // as base, and resolve the protocol.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)