[ 
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1418:
----------------------------------------

    Fix Version/s: 1.7
    
> error parsing robots rules- can't decode path: 
> /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1418
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1418
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Arijit Mukherjee
>             Fix For: 1.7
>
>
> Since learning that nutch will be unable to crawl the javascript function 
> calls in href, I started looking for other alternatives. I decided to crawl 
> http://en.wikipedia.org/wiki/Districts_of_India.
>     I first tried injecting this URL and follow the step-by-step approach 
> till fetcher - when I realized, nutch did not fetch anything from this 
> website. I tried looking into logs/hadoop.log and found the following 3 lines 
> - which I believe could be saying that nutch is unable to parse the 
> robots.txt in the website and ttherefore, fetcher stopped?
>    
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
>     I tried checking the URL using parsechecker and no issues there! I think 
> it means that the robots.txt is malformed for this website, which is 
> preventing fetcher from fetching anything. Is there a way to get around this 
> problem, as parsechecker seems to go on its merry way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to