Dear Nutch users,

I am currently using nutch 0.9 to crawl some local websites from the region and 
my fetch process just failed with the following error:

Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)          

The updatedb (which comes after the fetch) also failed:

CrawlDb update: starting
CrawlDb update: db: /data/02/nutch/crawl/crawldb
CrawlDb update: segments: [/data/02/nutch/crawl/segments/20071023000503]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
 - skipping invalid segment /data/02/nutch/crawl/segments/20071023000503
CrawlDb update: Merging segment data into db.                          


I just noticed that somehow nutch (fetch) uses my /tmp directory to store some 
temporary date in /tmp/hadoop-nutch which filled up to 100% my /tmp partition 
so I guess this is the problem. Now regarding this I have 3 quick questions:

- What is nutch exactly storing in /tmp/hadoop-nutch ?
- How can I force nutch to use another directory (which has more space) ?
- How can I recover my fetched sites and continue the process without loosing 
all the work that the fetcher already did up to the point where it stopped ?

Many thanks in advance for your help

Best regards


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to