[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982271#comment-13982271
 ] 

Sebastian Nagel commented on NUTCH-797:
---------------------------------------

Ok, then I'll take over to patch 2.x and resolve this issue.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> ------------------------------------------------------------------------------
>
>                 Key: NUTCH-797
>                 URL: https://issues.apache.org/jira/browse/NUTCH-797
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.1, nutchgora
>         Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>            Reporter: Robert Hohman
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-797-2x.patch, NUTCH-797.patch, 
> pureQueryUrl-2.patch, pureQueryUrl.patch, test_nutch_797.html
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> <a href="?co=0&sk=0&p=2&pi=1">2</a></td><td><a 
> href="?co=0&sk=0&p=3&pi=1">3</a>
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===================================================================
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>    (revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>    (working copy)
> @@ -299,6 +299,50 @@
>      return false;
>    }
>    
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +       // handle params that are embedded into the base url - move them to 
> target
> +       // so URL class constructs the new url class properly
> +       if  (base.toString().indexOf(';') > 0)  
> +          return fixEmbeddedParams(base, target);
> +       
> +       // handle the case that there is a target that is a pure query.
> +       // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +       // URLs but I've seen this in numerous places, for example at
> +       // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +       // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +       // URL constructs the base+target combo as 
> +       // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +       // dropping the Search.aspx target
> +       //
> +       // Browsers handle these just fine, they must have an exception 
> similar to this
> +       if (target.startsWith("?"))
> +       {
> +               return fixPureQueryTargets(base, target);
> +       }
> +       
> +       return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> +     if (!target.startsWith("?"))
> +             return new URL(base, target);
> +
> +     String basePath = base.getPath();
> +     String baseRightMost="";
> +     int baseRightMostIdx = basePath.lastIndexOf("/");
> +     if (baseRightMostIdx != -1)
> +     {
> +             baseRightMost = basePath.substring(baseRightMostIdx+1);
> +     }
> +     
> +     if (target.startsWith("?"))
> +             target = baseRightMost+target;
> +     
> +     return new URL(base, target);
> +  }
> +
>    /**
>     * Handles cases where the url param information is encoded into the base
>     * url as opposed to the target.
> @@ -400,8 +444,7 @@
>              if (target != null && !noFollow && !post)
>                try {
>                  
> -                URL url = (base.toString().indexOf(';') > 0) ? 
> -                  fixEmbeddedParams(base, target) :  new URL(base, target);
> +                URL url = fixURL(base, target);
>                  outlinks.add(new Outlink(url.toString(),
>                                           linkText.toString().trim()));
>                } catch (MalformedURLException e) {



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to