[
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405235#comment-13405235
]
Markus Jelsma commented on NUTCH-1418:
--------------------------------------
There is no problem crawling Wikipedia indeed. Anyway, the warning is fine and
the undecoded path is being added to the rule set. Perhaps the path should be
skipped, if it cannot be decoded there's no need in storing it in the rule set,
is there?
> error parsing robots rules- can't decode path:
> /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
> Key: NUTCH-1418
> URL: https://issues.apache.org/jira/browse/NUTCH-1418
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Arijit Mukherjee
>
> Since learning that nutch will be unable to crawl the javascript function
> calls in href, I started looking for other alternatives. I decided to crawl
> http://en.wikipedia.org/wiki/Districts_of_India.
> I first tried injecting this URL and follow the step-by-step approach
> till fetcher - when I realized, nutch did not fetch anything from this
> website. I tried looking into logs/hadoop.log and found the following 3 lines
> - which I believe could be saying that nutch is unable to parse the
> robots.txt in the website and ttherefore, fetcher stopped?
>
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
> I tried checking the URL using parsechecker and no issues there! I think
> it means that the robots.txt is malformed for this website, which is
> preventing fetcher from fetching anything. Is there a way to get around this
> problem, as parsechecker seems to go on its merry way parsing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira