[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288280#comment-16288280
 ] 

Sebastian Nagel commented on NUTCH-2478:
----------------------------------------

Confirmed:
{noformat}
$ cat /var/www/html/nutch/test_nutch_2478.html
<html>
<head>
  <title>Test NUTCH-2478</title>
  <base href="//www.example.com/">
</head>
<body>
  <a href="index/produkt/kanaly/">Kanaly</a>
</body>
</html>

$ $NUTCH_HOME/bin/nutch parsechecker http://localhost/nutch/test_nutch_2478.html
...
Outlinks: 1
  outlink: toUrl: http://localhost/nutch/index/produkt/kanaly/ anchor: Kanaly
...
{noformat}

But shouldn't the protocol depend on the originally fetched URL. In other 
words: whether the base URL and the outlink starts with http:// or https:// 
depends on the URL fetched.

The HTML standard or at least common sense is to resolve the base URL 
([1|https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base], 
[2|https://html.spec.whatwg.org/multipage/semantics.html#frozen-base-url]). 
What about properly resolving the base URL in parse-html and parse-tika. 
Actually, the method getBase(Node root) in DOMContentUtils isn't  resolving the 
URL found in {{<base href="..."/>}}.

> // is not a valid base URL
> --------------------------
>
>                 Key: NUTCH-2478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2478
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
>     URL base = new URL("//www.example.org/");
>     String target = "index/produkt/kanaly/";
>     
>     URL abs = URLUtil.resolveURL(base, target);
>     Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to