RE: Why did my crawl fail?

Arkadi.Kosmynin Sun, 26 Jul 2009 18:15:40 -0700

Sorry, I think you misunderstood me. I meant no content has been fetched on 
that iteration, for the segment that does not have parse_data.


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Paul
> Tomblin
> Sent: Monday, July 27, 2009 11:12 AM
> To: [email protected]
> Subject: Re: Why did my crawl fail?
> 
> No, it fetched thousands of pages - my blog and picture gallery.  It just
> never finished indexing them because as well as looking at the 11 segments
> that exist, it's also trying to look at a segment that doesn't.
> 
> On Sun, Jul 26, 2009 at 9:06 PM, <[email protected]> wrote:
> 
> > This is a very interesting issue. I guess that absence of parse_data
> means
> > that no content has been fetched. Am I wrong?
> >
> > This happened in my crawls a few times. Theoretically (I am guessing
> again)
> > this may happen if all urls selected for fetching on this iteration are
> > either blocked by the filters, or failed to be fetched, for whatever
> reason.
> >
> > I got around this problem by checking for presence of parse_data, and if
> it
> > is absent, deleting the segment. This seems to be working, but I am not
> 100%
> > sure that this is a good thing to do. Can I do this? Is it safe to do?
> Would
> > appreciate if someone with expert knowledge commented on this issue.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]] On Behalf Of Paul
> > > Tomblin
> > > Sent: Saturday, July 25, 2009 12:54 AM
> > > To: nutch-user
> > > Subject: Why did my crawl fail?
> > >
> > > I installed nutch 1.0 on my laptop last night and set it running to
> crawl
> > > my
> > > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> > > it was still running strong when I went to bed several hours later,
> and
> > > this
> > > morning I woke up to this:
> > >
> > > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: done
> > > CrawlDb update: starting
> > > CrawlDb update: db: crawl.blog/crawldb
> > > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: true
> > > CrawlDb update: URL filtering: true
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: done
> > > LinkDb: starting
> > > LinkDb: linkdb: crawl.blog/linkdb
> > > LinkDb: URL normalize: true
> > > LinkDb: URL filter: true
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> > > LinkDb: adding segment:
> > > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> > > Exception in thread "main"
> > org.apache.hadoop.mapred.InvalidInputException:
> > > Input path does not exist:
> > > file:/Users/ptomblin/nutch-
> > > 1.0/crawl.blog/segments/20090723154530/parse_data
> > > at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> > > 79)
> > > at
> > >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> > > putFormat.java:39)
> > > at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19
> > > 0)
> > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> > >
> > >
> > > --
> > > http://www.linkedin.com/in/paultomblin
> >
> 
> 
> 
> --
> http://www.linkedin.com/in/paultomblin

RE: Why did my crawl fail?

Reply via email to