Sorry, I think you misunderstood me. I meant no content has been fetched on that iteration, for the segment that does not have parse_data.
> -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Paul > Tomblin > Sent: Monday, July 27, 2009 11:12 AM > To: [email protected] > Subject: Re: Why did my crawl fail? > > No, it fetched thousands of pages - my blog and picture gallery. It just > never finished indexing them because as well as looking at the 11 segments > that exist, it's also trying to look at a segment that doesn't. > > On Sun, Jul 26, 2009 at 9:06 PM, <[email protected]> wrote: > > > This is a very interesting issue. I guess that absence of parse_data > means > > that no content has been fetched. Am I wrong? > > > > This happened in my crawls a few times. Theoretically (I am guessing > again) > > this may happen if all urls selected for fetching on this iteration are > > either blocked by the filters, or failed to be fetched, for whatever > reason. > > > > I got around this problem by checking for presence of parse_data, and if > it > > is absent, deleting the segment. This seems to be working, but I am not > 100% > > sure that this is a good thing to do. Can I do this? Is it safe to do? > Would > > appreciate if someone with expert knowledge commented on this issue. > > > > Regards, > > > > Arkadi > > > > > > > -----Original Message----- > > > From: [email protected] [mailto:[email protected]] On Behalf Of Paul > > > Tomblin > > > Sent: Saturday, July 25, 2009 12:54 AM > > > To: nutch-user > > > Subject: Why did my crawl fail? > > > > > > I installed nutch 1.0 on my laptop last night and set it running to > crawl > > > my > > > blog with the command: bin/nutch crawl urls -dir crawl.blog -depth 10 > > > it was still running strong when I went to bed several hours later, > and > > > this > > > morning I woke up to this: > > > > > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 > > > -activeThreads=0 > > > Fetcher: done > > > CrawlDb update: starting > > > CrawlDb update: db: crawl.blog/crawldb > > > CrawlDb update: segments: [crawl.blog/segments/20090724010303] > > > CrawlDb update: additions allowed: true > > > CrawlDb update: URL normalizing: true > > > CrawlDb update: URL filtering: true > > > CrawlDb update: Merging segment data into db. > > > CrawlDb update: done > > > LinkDb: starting > > > LinkDb: linkdb: crawl.blog/linkdb > > > LinkDb: URL normalize: true > > > LinkDb: URL filter: true > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250 > > > LinkDb: adding segment: > > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303 > > > Exception in thread "main" > > org.apache.hadoop.mapred.InvalidInputException: > > > Input path does not exist: > > > file:/Users/ptomblin/nutch- > > > 1.0/crawl.blog/segments/20090723154530/parse_data > > > at > > > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > > > 79) > > > at > > > > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > > > putFormat.java:39) > > > at > > > > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19 > > > 0) > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147) > > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129) > > > > > > > > > -- > > > http://www.linkedin.com/in/paultomblin > > > > > > -- > http://www.linkedin.com/in/paultomblin
