Arijit Mukherjee created NUTCH-1418:
---------------------------------------

             Summary: error parsing robots rules- can't decode path: 
/wiki/Wikipedia%3Mediation_Committee/
                 Key: NUTCH-1418
                 URL: https://issues.apache.org/jira/browse/NUTCH-1418
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4
            Reporter: Arijit Mukherjee


Since learning that nutch will be unable to crawl the javascript function calls 
in href, I started looking for other alternatives. I decided to crawl 
http://en.wikipedia.org/wiki/Districts_of_India.
    I first tried injecting this URL and follow the step-by-step approach till 
fetcher - when I realized, nutch did not fetch anything from this website. I 
tried looking into logs/hadoop.log and found the following 3 lines - which I 
believe could be saying that nutch is unable to parse the robots.txt in the 
website and ttherefore, fetcher stopped?
   
    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
    2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/

    I tried checking the URL using parsechecker and no issues there! I think it 
means that the robots.txt is malformed for this website, which is preventing 
fetcher from fetching anything. Is there a way to get around this problem, as 
parsechecker seems to go on its merry way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to