This is a very interesting issue. I guess that absence of parse_data means that no content has been fetched. Am I wrong?
This happened in my crawls a few times. Theoretically (I am guessing again) this may happen if all urls selected for fetching on this iteration are either blocked by the filters, or failed to be fetched, for whatever reason. I got around this problem by checking for presence of parse_data, and if it is absent, deleting the segment. This seems to be working, but I am not 100% sure that this is a good thing to do. Can I do this? Is it safe to do? Would appreciate if someone with expert knowledge commented on this issue. Regards, Arkadi > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Paul > Tomblin > Sent: Saturday, July 25, 2009 12:54 AM > To: nutch-user > Subject: Why did my crawl fail? > > I installed nutch 1.0 on my laptop last night and set it running to crawl > my > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > it was still running strong when I went to bed several hours later, and > this > morning I woke up to this: > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > -activeThreads=0 > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl.blog/crawldb > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > LinkDb: starting > LinkDb: linkdb: crawl.blog/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > LinkDb: adding segment: > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/Users/ptomblin/nutch- > 1.0/crawl.blog/segments/20090723154530/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > 79) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > putFormat.java:39) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19 > 0) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > > > -- > http://www.linkedin.com/in/paultomblin
