[ https://issues.apache.org/jira/browse/NUTCH-2554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430542#comment-16430542 ]
Gerard Bouchar commented on NUTCH-2554: --------------------------------------- Yes, it seems that the PR fixed the issue, thank you! > parserchecker can't fetch some URLs > ----------------------------------- > > Key: NUTCH-2554 > URL: https://issues.apache.org/jira/browse/NUTCH-2554 > Project: Nutch > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > > The parserchecker (org.apache.nutch.parse.ParserChecker) calls > _URLUtil.toASCII_ on the url it is given, reencoding already percent-encoded > URLs. > For instance, let's say we want to query > [http://example.com|http://example.com_/], passing a GET parameter with name > 'q' and value '/'. '/' is a special character, and thus has to be encoded > before being sent. > If we pass '[_http://example.com/?q=/_'|http://example.com/?q=/%27] to the > parserchecker, then it doesn't encode the '/', and tries to fetch the URL as > is, which is invalid. > If we try to encode the parameter beforehand, and call the parsechecker with > 'http://example.com/?q=%2F', then it encodes the '%' sign to '%25', and thus > fetches '[http://example.com/?q=%252F'.|http://example.com/?q=%252F%27.] > This actually makes it impossible to fetch the correct URL > ([http://example.com/?q=%2F]) from the parserchecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)