[
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-797:
-------------------------------------
Assignee: Sebastian Nagel (was: Julien Nioche)
> URL not properly constructed when link target begins with a "?"
> ---------------------------------------------------------------
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.1, nutchgora
> Environment: Win 7, Java(TM) SE Runtime Environment (build
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
> Reporter: Robert Hohman
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-797-2x-v2.patch, NUTCH-797-2x.patch,
> NUTCH-797.patch, pureQueryUrl-2.patch, pureQueryUrl.patch, test_nutch_797.html
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided
> enough detail.
> In crawling the page at
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are
> links in the page that look like this:
> <a href="?co=0&sk=0&p=2&pi=1">2</a></td><td><a
> href="?co=0&sk=0&p=3&pi=1">3</a>
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as
> getOutlinks looks for links, it comes across this link, and constucts a new
> url with a base URL class built from
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0", and a
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines
> how to merge these two, and per the RFC, the URL class merges these to:
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for
> the next round of fetching are incorrect. Modern browsers seem to handle
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ?
> begins the target, and then pulling the rightmost component out of the base
> and inserting it into the target before the ?, so the target in this example
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution? Specifically I'm worried about
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index:
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===================================================================
> ---
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> (revision 916362)
> +++
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> (working copy)
> @@ -299,6 +299,50 @@
> return false;
> }
>
> + private URL fixURL(URL base, String target) throws MalformedURLException
> + {
> + // handle params that are embedded into the base url - move them to
> target
> + // so URL class constructs the new url class properly
> + if (base.toString().indexOf(';') > 0)
> + return fixEmbeddedParams(base, target);
> +
> + // handle the case that there is a target that is a pure query.
> + // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on
> how to assemble
> + // URLs but I've seen this in numerous places, for example at
> + // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> + // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by
> default
> + // URL constructs the base+target combo as
> + // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1,
> incorrectly
> + // dropping the Search.aspx target
> + //
> + // Browsers handle these just fine, they must have an exception
> similar to this
> + if (target.startsWith("?"))
> + {
> + return fixPureQueryTargets(base, target);
> + }
> +
> + return new URL(base, target);
> + }
> +
> + private URL fixPureQueryTargets(URL base, String target) throws
> MalformedURLException
> + {
> + if (!target.startsWith("?"))
> + return new URL(base, target);
> +
> + String basePath = base.getPath();
> + String baseRightMost="";
> + int baseRightMostIdx = basePath.lastIndexOf("/");
> + if (baseRightMostIdx != -1)
> + {
> + baseRightMost = basePath.substring(baseRightMostIdx+1);
> + }
> +
> + if (target.startsWith("?"))
> + target = baseRightMost+target;
> +
> + return new URL(base, target);
> + }
> +
> /**
> * Handles cases where the url param information is encoded into the base
> * url as opposed to the target.
> @@ -400,8 +444,7 @@
> if (target != null && !noFollow && !post)
> try {
>
> - URL url = (base.toString().indexOf(';') > 0) ?
> - fixEmbeddedParams(base, target) : new URL(base, target);
> + URL url = fixURL(base, target);
> outlinks.add(new Outlink(url.toString(),
> linkText.toString().trim()));
> } catch (MalformedURLException e) {
--
This message was sent by Atlassian JIRA
(v6.2#6252)