Two days ago I posted this message below to the nutch-user list already. Because nobody answered yet I think this is more an developer than an user issue. (for me it seems to be a bug). I would like to discuss it with a nutch developer. thanks!
---------------------------------------------- Hello, just a view days ago we started to use Nutch (0.7.1). It's really nice and I would like to see it evolve. Here's my issue/question: While fetching our URLs, we got some errors like this: 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. That seems to be ok and indicates some network problems. The problem is that the entry in the Webdb shows the following: Page 4: Version: 4 URL: http://www.test-domain.de/crawl_html/page_2.html ID: b360ec931855b0420776909bd96557c0 Next fetch: Sun Aug 17 07:12:55 CET 292278994 Retries since fetch: 0 Retry interval: 0 days The 'Next fetch' date is set to the year '292278994'. Probably I wouldn't be able to see the refetch alive. ;) What's wrong here? I hope it's not my lifespan. ;) A page that couldn't be crawled because of networks-problems, should be refetched with the next crawl (== set next fetch date to the next day). I'm just using standard api of nutch 0.7.1 like: WebDBWriter webdb = new WebDBWriter(fileSystem, new File(dbPath)); UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, true, -1); tool.updateForSegment(fileSystem, lseg); tool.close(); Thanks mos
