Why did my crawl fail?

Paul Tomblin Fri, 24 Jul 2009 07:54:06 -0700

I installed nutch 1.0 on my laptop last night and set it running to crawl my
blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
it was still running strong when I went to bed several hours later, and this
morning I woke up to this:


activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl.blog/crawldb
CrawlDb update: segments: [crawl.blog/segments/20090724010303]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl.blog/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
LinkDb: adding segment:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)


-- 
http://www.linkedin.com/in/paultomblin

Why did my crawl fail?

Reply via email to