Hi,

I'm ran the crawl with the script from http://wiki.apache.org/nutch/Crawl

This time, I defined a break (exit) before the invertlinks is executed and ran 
the rest nutch commands manually *without* errors.

I suppose that something is going wrong by moving the MERGEDsegments to 
segments:

mv $MVARGS $crawldir/MERGEDsegments $crawldir/segments

I'm running the crawl in a VirtualMachine (Debian) and execute the crawl-script 
using "nohup":
nohup runbot.sh > foo.out 2> foo.err < /dev/null &

Is it possible that invertlinks is executed before the "mv" command is finished?
or
are there some problems to run the runbot.sh using  "nohup"?

Thanks
Pato


----- Ursprüngliche Mail ----
Von: kevin chen <kevinc...@bdsing.com>
An: nutch-user@lucene.apache.org
Gesendet: Freitag, den 19. März 2010, 3:41:40 Uhr
Betreff: Re: invertlinks: Input path does not exist

Sounds like the last segment is corrupted.
Did you try to remove the last segment?

On Wed, 2010-03-17 at 16:10 +0000, Patricio Galeas wrote:
>     Hello all,
> 
> I crawling the web using the
> LanguageIdentifier plugin, but I get an error by running nutch
> invertlinks.
> The error occurs always by processing
> the last segment (20100317010313-81).
> 
> The problem is the same described in
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html
> With both syntax variants of
> invertlinks I get the same error:
> a) nutch invertlinks crawl/linkdb -dir
> crawl/segments
> b) nutch invertlinks crawl/linkdb
> crawl/segments/*
> 
> I applied
> https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java
> heap problems by using the Language Identifier, but I got the same error. ;-(
> 
> I set the NUTCH_HEAPSIZE with 6000
> (physical memory) and I merged the segments using slice=50000
> 
> Any idea where to look for ?
> 
> Thanks
> Pato
> 
> --------------------hadoop.log----------------------------------
> ..
> 2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47
> 2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68
> 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56
> 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12
> 2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26
> 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73
> 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59
> 2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30
> 2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2
> 2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34
> 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52
> 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29
> 2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24
> 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: 
> org.apache.hadoop.mapred.InvalidInputException: 
> Input path does not exist: 
> file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-81/parse_data
>         at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
>         at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
>         at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
>         at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
> 
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz 
> gegen Massenmails. 
> http://mail.yahoo.com 

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com

Reply via email to