parse-tika is not properly constructing URLs when the target begins with a "?"
------------------------------------------------------------------------------

                 Key: NUTCH-797
                 URL: https://issues.apache.org/jira/browse/NUTCH-797
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.1
         Environment: Win 7, Java(TM) SE Runtime Environment (build 
1.6.0_16-b01)
Also repro's on RHEL and java 1.4.2
            Reporter: Robert Hohman
            Priority: Minor
         Attachments: pureQueryUrl.patch

This is my first bug and patch on nutch, so apologies if I have not provided 
enough detail.

In crawling the page at 
http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
links in the page that look like this:

<a href="?co=0&sk=0&p=2&pi=1">2</a></td><td><a href="?co=0&sk=0&p=3&pi=1">3</a>

in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
getOutlinks looks for links, it comes across this link, and constucts a new url 
with a base URL class built from 
"http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
target of "?co=0&sk=0&p=2&pi=1"

The URL class, per RFC 3986 at 
http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines how 
to merge these two, and per the RFC, the URL class merges these to: 
http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1

because the RFC explicitly states that the rightmost url segment (the 
Search.aspx in this case) should be ripped off before combining.

While this is compliant with the RFC, it means the URLs which are created for 
the next round of fetching are incorrect.  Modern browsers seem to handle this 
case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
exception or handling of what is a poorly formed url on accenture's part.

I have fixed this by modifying DOMContentUtils to look for the case where a ? 
begins the target, and then pulling the rightmost component out of the base and 
inserting it into the target before the ?, so the target in this example 
becomes:
Search.aspx?co=0&sk=0&p=2&pi=1

The URL class then properly constructs the new url as:
http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1

If it is agreed that this solution works, I believe the other html parsers in 
nutch would need to be modified in a similar way.

Can I get feedback on this proposed solution?  Specifically I'm worried about 
unforeseen side effects.

Much thanks

Here is the patch info:
Index: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
===================================================================
--- 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java 
    (revision 916362)
+++ 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java 
    (working copy)
@@ -299,6 +299,50 @@
     return false;
   }
   
+  private URL fixURL(URL base, String target) throws MalformedURLException
+  {
+         // handle params that are embedded into the base url - move them to 
target
+         // so URL class constructs the new url class properly
+         if  (base.toString().indexOf(';') > 0)  
+          return fixEmbeddedParams(base, target);
+         
+         // handle the case that there is a target that is a pure query.
+         // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
how to assemble
+         // URLs but I've seen this in numerous places, for example at
+         // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
+         // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
default
+         // URL constructs the base+target combo as 
+         // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
incorrectly
+         // dropping the Search.aspx target
+         //
+         // Browsers handle these just fine, they must have an exception 
similar to this
+         if (target.startsWith("?"))
+         {
+                 return fixPureQueryTargets(base, target);
+         }
+         
+         return new URL(base, target);
+  }
+  
+  private URL fixPureQueryTargets(URL base, String target) throws 
MalformedURLException
+  {
+       if (!target.startsWith("?"))
+               return new URL(base, target);
+
+       String basePath = base.getPath();
+       String baseRightMost="";
+       int baseRightMostIdx = basePath.lastIndexOf("/");
+       if (baseRightMostIdx != -1)
+       {
+               baseRightMost = basePath.substring(baseRightMostIdx+1);
+       }
+       
+       if (target.startsWith("?"))
+               target = baseRightMost+target;
+       
+       return new URL(base, target);
+  }
+
   /**
    * Handles cases where the url param information is encoded into the base
    * url as opposed to the target.
@@ -400,8 +444,7 @@
             if (target != null && !noFollow && !post)
               try {
                 
-                URL url = (base.toString().indexOf(';') > 0) ? 
-                  fixEmbeddedParams(base, target) :  new URL(base, target);
+                URL url = fixURL(base, target);
                 outlinks.add(new Outlink(url.toString(),
                                          linkText.toString().trim()));
               } catch (MalformedURLException e) {


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to