[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294112#comment-16294112
 ] 

Hudson commented on NUTCH-2478:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3483 (See 
[https://builds.apache.org/job/Nutch-trunk/3483/])
NUTCH-2478 HTML parser should resolve base URL <base href=...> - fix (snagel: 
[https://github.com/apache/nutch/commit/607e7d950a2f3399db161b5a6770b40bf1d60c1a])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
NUTCH-2478 HTML parser should resolve base URL <base href=...> - finally 
(snagel: 
[https://github.com/apache/nutch/commit/2aec79f13b04e022f0c30830a5e621cfcfffc88d])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (add) src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java


> // is not a valid base URL
> --------------------------
>
>                 Key: NUTCH-2478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2478
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
>     URL base = new URL("//www.example.org/");
>     String target = "index/produkt/kanaly/";
>     
>     URL abs = URLUtil.resolveURL(base, target);
>     Assert.assertEquals("http://www.example.org/index/produkt/kanaly/";, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to