Re: Crawl dies unexpectedly

Dennis Kubes Mon, 31 Mar 2008 06:38:03 -0700

If you have a crawl depth of 3 then there should be only 3 segments/*folders. Any idea where the others came from?


Dennis


matt davies wrote:

Hello everyone
I've just added 12 urls to my urls/filename file and added the same URLSto my craw-urlfilter.txt file and ran the crawl like so
bin/nutch crawl urls -dir crawl -depth 3
the crawl runs fine, it starts grabbing the urls and creating thesegments, but then all of a sudden it dies with the following error whentrying to merge the segments.
CrawlDb update: segments: [crawl/segments/20080331113907]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:file:/home/nutch/nutch/trunk/crawl/segments/20080331112151LinkDb: adding segment:file:/home/nutch/nutch/trunk/crawl/segments/20080331111831LinkDb: adding segment:file:/home/nutch/nutch/trunk/crawl/segments/20080331111720LinkDb: adding segment:file:/home/nutch/nutch/trunk/crawl/segments/20080331111741LinkDb: adding segment:file:/home/nutch/nutch/trunk/crawl/segments/20080331113907Exception in thread "main"org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist: file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/parse_dataatorg.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
I checked one of the other segments, 20080331111720, and this containedthe following data
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text

But the segment with the problem in does not contain all that data, only

drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
has anyone got any ideas what could be going wrong here? I've checkedspace issues, loads of gigs free, and permissions on the folders areidentical.
Here's my nutch svn details

[EMAIL PROTECTED]:~/nutch/trunk$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 641752
Node Kind: directory
Schedule: normal
Last Changed Author: ab
Last Changed Rev: 638782
Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)

Any  help, greatly appreciated.

Re: Crawl dies unexpectedly

Reply via email to