Re: Why did my crawl fail?

Paul Tomblin Mon, 27 Jul 2009 19:51:29 -0700

Unfortunately I blew away those particular logs when I fetched the svn
trunk.  I just tried it again (well, I started it again at noon and it just
finished) and this time it worked fine, so it seems kind of heisenbug-like.
 Maybe it has something to do with which pages are types it can't handle?


On Mon, Jul 27, 2009 at 11:27 AM, xiao yang <[email protected]> wrote:

> Hi, Paul
>
> Can you post the error messages in the log file
> (file:/Users/ptomblin/nutch-1.0/logs)?
>
> On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin<[email protected]> wrote:
> > Actually, I got that error the first time I used it, and then again when
> I
> > blew away the downloaded nutch and grabbed the latest trunk from
> Subversion.
> >
> > On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <[email protected]>
> wrote:
> >
> >> You must have crawled for several times, and some of them failed
> >> before the parse phase. So the parse data was not generated.
> >> You'd better delete the whole directory
> >> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
> >> will know the exact reason why it failed in the parse phase from the
> >> output information.
> >>
> >> Xiao
> >>
> >> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<[email protected]>
> wrote:
> >> > I installed nutch 1.0 on my laptop last night and set it running to
> crawl
> >> my
> >> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
> >> > it was still running strong when I went to bed several hours later,
> and
> >> this
> >> > morning I woke up to this:
> >> >
> >> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> > -activeThreads=0
> >> > Fetcher: done
> >> > CrawlDb update: starting
> >> > CrawlDb update: db: crawl.blog/crawldb
> >> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
> >> > CrawlDb update: additions allowed: true
> >> > CrawlDb update: URL normalizing: true
> >> > CrawlDb update: URL filtering: true
> >> > CrawlDb update: Merging segment data into db.
> >> > CrawlDb update: done
> >> > LinkDb: starting
> >> > LinkDb: linkdb: crawl.blog/linkdb
> >> > LinkDb: URL normalize: true
> >> > LinkDb: URL filter: true
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
> >> > LinkDb: adding segment:
> >> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
> >> > Exception in thread "main"
> >> org.apache.hadoop.mapred.InvalidInputException:
> >> > Input path does not exist:
> >> >
> >>
> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> >> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
> >> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
> >> >
> >> >
> >> > --
> >> > http://www.linkedin.com/in/paultomblin
> >> >
> >>
> >
> >
> >
> > --
> > http://www.linkedin.com/in/paultomblin
> >
>



-- 
http://www.linkedin.com/in/paultomblin

Re: Why did my crawl fail?

Reply via email to