[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487318#comment-13487318 ]
Sebastian Nagel commented on NUTCH-578: --------------------------------------- Resetting the retry counter in setPageGoneSchedule has some disadvantages: * the information is lost that the db_gone results from a number of unsuccessful fetches due to transient errors * maybe you do not want to "get again 3 trials after db.max.fetch.interval is reached". If a page has been fetched 3 times in a row with a 403 and we try again after one month and get a 403 again, we do not need 3 trials any more. > URL fetched with 403 is generated over and over again > ----------------------------------------------------- > > Key: NUTCH-578 > URL: https://issues.apache.org/jira/browse/NUTCH-578 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 1.0.0 > Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I > have checked out the most recent version of the trunk as of Nov 20, 2007 > Reporter: Nathaniel Powell > Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: crawl-urlfilter.txt, NUTCH-578.patch, > NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, > NUTCH-578_v5.patch, nutch-site.xml, regex-normalize.xml, urls.txt > > > I have not changed the following parameter in the nutch-default.xml: > <property> > <name>db.fetch.retry.max</name> > <value>3</value> > <description>The maximum number of times a url that has encountered > recoverable errors is generated for fetch.</description> > </property> > However, there is a URL which is on the site that I'm crawling, > www.teachertube.com, which keeps being generated over and over again for > almost every segment (many more times than 3): > fetch of http://www.teachertube.com/images/ failed with: Http code=403, > url=http://www.teachertube.com/images/ > This is a bug, right? > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira