If you have a crawl depth of 3 then there should be only 3 segments/* folders. Any idea where the others came from?

Dennis

matt davies wrote:
Hello everyone

I've just added 12 urls to my urls/filename file and added the same URLS to my craw-urlfilter.txt file and ran the crawl like so

bin/nutch crawl urls -dir crawl -depth 3

the crawl runs fine, it starts grabbing the urls and creating the segments, but then all of a sudden it dies with the following error when trying to merge the segments.

CrawlDb update: segments: [crawl/segments/20080331113907]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/20080331112151 LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/20080331111831 LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/20080331111720 LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/20080331111741 LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/20080331113907 Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/parse_data at org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:154)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

I checked one of the other segments, 20080331111720, and this contained the following data

drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text

But the segment with the problem in does not contain all that data, only

drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate

has anyone got any ideas what could be going wrong here? I've checked space issues, loads of gigs free, and permissions on the folders are identical.

Here's my nutch svn details

[EMAIL PROTECTED]:~/nutch/trunk$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 641752
Node Kind: directory
Schedule: normal
Last Changed Author: ab
Last Changed Rev: 638782
Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)

Any  help, greatly appreciated.



Reply via email to