Dear Nutch users,
I am currently using nutch 0.9 to crawl some local websites from the region and
my fetch process just failed with the following error:
Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
The updatedb (which comes after the fetch) also failed:
CrawlDb update: starting
CrawlDb update: db: /data/02/nutch/crawl/crawldb
CrawlDb update: segments: [/data/02/nutch/crawl/segments/20071023000503]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
- skipping invalid segment /data/02/nutch/crawl/segments/20071023000503
CrawlDb update: Merging segment data into db.
I just noticed that somehow nutch (fetch) uses my /tmp directory to store some
temporary date in /tmp/hadoop-nutch which filled up to 100% my /tmp partition
so I guess this is the problem. Now regarding this I have 3 quick questions:
- What is nutch exactly storing in /tmp/hadoop-nutch ?
- How can I force nutch to use another directory (which has more space) ?
- How can I recover my fetched sites and continue the process without loosing
all the work that the fetcher already did up to the point where it stopped ?
Many thanks in advance for your help
Best regards
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com