Two days ago I posted this message below to the nutch-user list already. Because nobody answered yet I think this is more an developer than an user issue. (for me it seems to be a bug). I would like to discuss it with a nutch developer. thanks!
---------------------------------------------- Hello, just a view days ago we started to use Nutch (0.7.1). It's really nice and I would like to see it evolve. Here's my issue/question: While fetching our URLs, we got some errors like this: 60202 154316 fetch of http://www.test-domain.de/crawl_html/page_2.html failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. That seems to be ok and indicates some network problems. The problem is that the entry in the Webdb shows the following: Page 4: Version: 4 URL: http://www.test-domain.de/crawl_html/page_2.html ID: b360ec931855b0420776909bd96557c0 Next fetch: Sun Aug 17 07:12:55 CET 292278994 Retries since fetch: 0 Retry interval: 0 days The 'Next fetch' date is set to the year '292278994'. Probably I wouldn't be able to see the refetch alive. ;) What's wrong here? I hope it's not my lifespan. ;) A page that couldn't be crawled because of networks-problems, should be refetched with the next crawl (== set next fetch date to the next day). I'm just using standard api of nutch 0.7.1 like: WebDBWriter webdb = new WebDBWriter(fileSystem, new File(dbPath)); UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, true, -1); tool.updateForSegment(fileSystem, lseg); tool.close(); Thanks mos ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
