I installed nutch 1.0 on my laptop last night and set it running to crawl my blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 it was still running strong when I went to bed several hours later, and this morning I woke up to this:
activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl.blog/crawldb CrawlDb update: segments: [crawl.blog/segments/20090724010303] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl.blog/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 LinkDb: adding segment: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) -- http://www.linkedin.com/in/paultomblin
