Re: Crawl dies unexpectedly

Susam Pal Mon, 31 Mar 2008 10:14:17 -0700

Hi,

You seem to be using the latest revision from trunk. In the commit for
revision #637122, recrawling was introduced. So, you can crawl using
the same 'crawl' directory more than once. If you do a crawl with
-depth M in the first crawl and -depth N again, you'll end up with M +
N segments.


My guess is that you might have stopped the first crawl before
completion and the segment which remained incomplete caused the error.
If my guess is right, you would probably get the same error again due
to the same segment. If it happens, you might have to delete that
segment to proceed with the index generation.

Regards,
Susam Pal

On Mon, Mar 31, 2008 at 7:14 PM, matt davies <[EMAIL PROTECTED]> wrote:
> Hi Dennis
>
>
>  "If you have a crawl depth of 3 then there should be only 3 segments/*
>  folder"
>
>  Thanks for that titbit, that makes a bit more sense now.
>
>  I have no idea where the other ones are coming from.
>
>  One of the sites I'm scanning is quit large, more than 10,000 pages,
>  in total we're talking about roughly 20,000 pages.
>
>  What would you recommend setting the crawl depth to Dennis?
>
>  I've tried rerunning the crawl after deleting the entire folder that
>  it was jamming on, it seems to be crawling again.
>
>  See what happens this time
>
>  Thanks for getting back to me Dennis.
>
>
>
>
>
>  On 31 Mar 2008, at 14:37, Dennis Kubes wrote:
>
>  > If you have a crawl depth of 3 then there should be only 3 segments/
>  > * folders.  Any idea where the others came from?
>  >
>  > Dennis
>  >
>  > matt davies wrote:
>  >> Hello everyone
>  >> I've just added 12 urls to my urls/filename file and added the same
>  >> URLS to my craw-urlfilter.txt file and ran the crawl like so
>  >> bin/nutch crawl urls -dir crawl -depth 3
>  >> the crawl runs fine, it starts grabbing the urls and creating the
>  >> segments, but then all of a sudden it dies with the following error
>  >> when trying to merge the segments.
>  >> CrawlDb update: segments: [crawl/segments/20080331113907]
>  >> CrawlDb update: additions allowed: true
>  >> CrawlDb update: URL normalizing: true
>  >> CrawlDb update: URL filtering: true
>  >> CrawlDb update: Merging segment data into db.
>  >> CrawlDb update: done
>  >> LinkDb: starting
>  >> LinkDb: linkdb: crawl/linkdb
>  >> LinkDb: URL normalize: true
>  >> LinkDb: URL filter: true
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331112151
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111831
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111720
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331111741
>  >> LinkDb: adding segment: file:/home/nutch/nutch/trunk/crawl/segments/
>  >> 20080331113907
>  >> Exception in thread "main"
>  >> org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
>  >> exist : file:/home/nutch/nutch/trunk/crawl/segments/20080331111741/
>  >> parse_data
>  >>    at
>  >> org
>  >> .apache
>  >> .hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:
>  >> 154)     at
>  >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:537)
>  >>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>  >>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>  >>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
>  >>    at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)
>  >> I checked one of the other segments, 20080331111720, and this
>  >> contained the following data
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 content
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 crawl_fetch
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_parse
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_data
>  >> drwxr-xr-x 3 nutch nutch 4096 2008-03-31 11:17 parse_text
>  >> But the segment with the problem in does not contain all that data,
>  >> only
>  >> drwxr-xr-x 2 nutch nutch 4096 2008-03-31 11:17 crawl_generate
>  >> has anyone got any ideas what could be going wrong here?  I've
>  >> checked space issues, loads of gigs free, and permissions on the
>  >> folders are identical.
>  >> Here's my nutch svn details
>  >> [EMAIL PROTECTED]:~/nutch/trunk$ svn info
>  >> Path: .
>  >> URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
>  >> Repository Root: http://svn.apache.org/repos/asf
>  >> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
>  >> Revision: 641752
>  >> Node Kind: directory
>  >> Schedule: normal
>  >> Last Changed Author: ab
>  >> Last Changed Rev: 638782
>  >> Last Changed Date: 2008-03-19 10:45:55 +0000 (Wed, 19 Mar 2008)
>  >> Any  help, greatly appreciated.
>
>

Re: Crawl dies unexpectedly

Reply via email to