Re: Why did my crawl fail?

xiao yang Mon, 27 Jul 2009 08:27:48 -0700

Hi, Paul

Can you post the error messages in the log file
(file:/Users/ptomblin/nutch-1.0/logs)?


On Mon, Jul 27, 2009 at 6:55 PM, Paul Tomblin<[email protected]> wrote:
> Actually, I got that error the first time I used it, and then again when I
> blew away the downloaded nutch and grabbed the latest trunk from Subversion.
>
> On Mon, Jul 27, 2009 at 1:11 AM, xiao yang <[email protected]> wrote:
>
>> You must have crawled for several times, and some of them failed
>> before the parse phase. So the parse data was not generated.
>> You'd better delete the whole directory
>> file:/Users/ptomblin/nutch-1.0/crawl.blog, and recrawl it, then you
>> will know the exact reason why it failed in the parse phase from the
>> output information.
>>
>> Xiao
>>
>> On Fri, Jul 24, 2009 at 10:53 PM, Paul Tomblin<[email protected]> wrote:
>> > I installed nutch 1.0 on my laptop last night and set it running to crawl
>> my
>> > blog with the command:  bin/nutch crawl urls -dir crawl.blog -depth 10
>> > it was still running strong when I went to bed several hours later, and
>> this
>> > morning I woke up to this:
>> >
>> > activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> > -activeThreads=0
>> > Fetcher: done
>> > CrawlDb update: starting
>> > CrawlDb update: db: crawl.blog/crawldb
>> > CrawlDb update: segments: [crawl.blog/segments/20090724010303]
>> > CrawlDb update: additions allowed: true
>> > CrawlDb update: URL normalizing: true
>> > CrawlDb update: URL filtering: true
>> > CrawlDb update: Merging segment data into db.
>> > CrawlDb update: done
>> > LinkDb: starting
>> > LinkDb: linkdb: crawl.blog/linkdb
>> > LinkDb: URL normalize: true
>> > LinkDb: URL filter: true
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155106
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155122
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155303
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723155812
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723161808
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723171215
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723193543
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723224936
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724004250
>> > LinkDb: adding segment:
>> > file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090724010303
>> > Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException:
>> > Input path does not exist:
>> >
>> file:/Users/ptomblin/nutch-1.0/crawl.blog/segments/20090723154530/parse_data
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>> > at
>> >
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>> > at
>> >
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>> > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>> > at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>> >
>> >
>> > --
>> > http://www.linkedin.com/in/paultomblin
>> >
>>
>
>
>
> --
> http://www.linkedin.com/in/paultomblin
>

Re: Why did my crawl fail?

Reply via email to