I'm testing nutch with a view to exhaustive scraping (using version 0.8). But I've got some sites that don't scrape and no idea why. Case in point is http://www.idc.com.
This is a HUGE site, but I get nothing in nutch. Hadoop.log reports 2006-08-06 18:04:00,187 INFO fetcher.Fetcher - fetching http://www.idc.com/ 2006-08-06 18:04:00,203 INFO http.Http - http.proxy.host = null 2006-08-06 18:04:00,203 INFO http.Http - http.proxy.port = 8080 2006-08-06 18:04:00,203 INFO http.Http - http.timeout = 10000 2006-08-06 18:04:00,203 INFO http.Http - http.content.limit = -1 2006-08-06 18:04:00,203 INFO http.Http - http.agent = IDCL/Nutch-0.8 2006-08-06 18:04:00,203 INFO http.Http - fetcher.server.delay = 500 2006-08-06 18:04:00,203 INFO http.Http - http.max.delays = 1000 2006-08-06 18:04:02,093 INFO fetcher.Fetcher - Fetcher: done Readdb (-dump) reports http://www.idc.com/ Version: 4 Status: 3 (DB_gone) Fetch time: Sun Aug 06 18:04:00 BST 2006 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: null And segread reports Recno:: 0 URL:: http://www.idc.com/ CrawlDatum:: Version: 4 Status: 1 (DB_unfetched) Fetch time: Sun Aug 06 18:03:54 BST 2006 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: null CrawlDatum:: Version: 4 Status: 7 (fetch_gone) Fetch time: Sun Aug 06 18:04:00 BST 2006 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0 Signature: null Metadata: null Content:: url: http://www.idc.com/ base: http://www.idc.com/ contentType: metadata: nutch.segment.name=20060806180357 nutch.crawl.score=1.0 Content: I have been unable to find defintions of fetch_gone anywhere, so I'm not really sure what these logs mean. I have a baby scraper which I want to abandon in favour of nutch and this has no problems (well, not different ones!) with this site. The page is there, there's nothing obviously odd about it, but it's not brought back and or stored. How can I diagnose this? Iain
