[
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288289#comment-16288289
]
Markus Jelsma commented on NUTCH-2478:
--------------------------------------
Yes, this needs a change in the parser plugins. I sought to fix it in Content's
constructor, but that wouldn't do it, the base is not passed. It needed a
change to DomContentUtils to also pass the protocol to the method from the
ParserPlugin. In getBase() orso i then resolved the protocol if base URL starts
with //.
We already solved it for a customer in our custom ParsePlugin, i will later
post a patch for ParseTika and ParseHtml.
> // is not a valid base URL
> --------------------------
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.13
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
> @Test
> public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
>
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/",
> abs.toString());
> }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If
> current URL is not /, its path will be prepended, resulting in 404 being
> crawled.
> This ticket must allow // as base, and resolve the protocol.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)