[
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tejas Patil resolved NUTCH-1418.
--------------------------------
Resolution: Fixed
Fix Version/s: 2.2
After the robots handling has been delegated to crawler commons (NUTCH-1031),
this issue is NOT reproducible.
The url in question gets crawled:
{noformat}http://en.wikipedia.org/wiki/Districts_of_India Version: 7
Status: 2 (db_fetched)
Fetch time: Tue Jun 11 04:47:14 PDT 2013
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.4599998
Signature: b0ec6daf534d9d28f3b49ad7915af89c
Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
{noformat}
> error parsing robots rules- can't decode path:
> /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
> Key: NUTCH-1418
> URL: https://issues.apache.org/jira/browse/NUTCH-1418
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Arijit Mukherjee
> Fix For: 1.7, 2.2
>
>
> Since learning that nutch will be unable to crawl the javascript function
> calls in href, I started looking for other alternatives. I decided to crawl
> http://en.wikipedia.org/wiki/Districts_of_India.
> I first tried injecting this URL and follow the step-by-step approach
> till fetcher - when I realized, nutch did not fetch anything from this
> website. I tried looking into logs/hadoop.log and found the following 3 lines
> - which I believe could be saying that nutch is unable to parse the
> robots.txt in the website and ttherefore, fetcher stopped?
>
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
> 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots
> rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
> I tried checking the URL using parsechecker and no issues there! I think
> it means that the robots.txt is malformed for this website, which is
> preventing fetcher from fetching anything. Is there a way to get around this
> problem, as parsechecker seems to go on its merry way parsing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira