I'm testing nutch with a view to exhaustive scraping (using version 0.8).

But I've got some sites that don't scrape and no idea why.  Case in point is
http://www.idc.com.

This is a HUGE site, but I get nothing in nutch.

Hadoop.log reports
2006-08-06 18:04:00,187 INFO  fetcher.Fetcher - fetching http://www.idc.com/
2006-08-06 18:04:00,203 INFO  http.Http - http.proxy.host = null
2006-08-06 18:04:00,203 INFO  http.Http - http.proxy.port = 8080
2006-08-06 18:04:00,203 INFO  http.Http - http.timeout = 10000
2006-08-06 18:04:00,203 INFO  http.Http - http.content.limit = -1
2006-08-06 18:04:00,203 INFO  http.Http - http.agent = IDCL/Nutch-0.8
2006-08-06 18:04:00,203 INFO  http.Http - fetcher.server.delay = 500
2006-08-06 18:04:00,203 INFO  http.Http - http.max.delays = 1000
2006-08-06 18:04:02,093 INFO  fetcher.Fetcher - Fetcher: done

Readdb (-dump) reports
http://www.idc.com/     Version: 4
Status: 3 (DB_gone)
Fetch time: Sun Aug 06 18:04:00 BST 2006
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

And segread reports

Recno:: 0
URL:: http://www.idc.com/

CrawlDatum::
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Sun Aug 06 18:03:54 BST 2006
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

CrawlDatum::
Version: 4
Status: 7 (fetch_gone)
Fetch time: Sun Aug 06 18:04:00 BST 2006
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: null

Content::
url: http://www.idc.com/
base: http://www.idc.com/
contentType: 
metadata: nutch.segment.name=20060806180357 nutch.crawl.score=1.0 
Content:

I have been unable to find defintions of fetch_gone anywhere, so I'm not
really sure what these logs mean.

I have a baby scraper which I want to abandon in favour of nutch and this
has no problems (well, not different ones!) with this site.

The page is there, there's nothing obviously odd about it, but it's not
brought back and or stored.

How can I diagnose this?

Iain 


Reply via email to