Hi, I'm ran the crawl with the script from http://wiki.apache.org/nutch/Crawl
This time, I defined a break (exit) before the invertlinks is executed and ran the rest nutch commands manually *without* errors. I suppose that something is going wrong by moving the MERGEDsegments to segments: mv $MVARGS $crawldir/MERGEDsegments $crawldir/segments I'm running the crawl in a VirtualMachine (Debian) and execute the crawl-script using "nohup": nohup runbot.sh > foo.out 2> foo.err < /dev/null & Is it possible that invertlinks is executed before the "mv" command is finished? or are there some problems to run the runbot.sh using "nohup"? Thanks Pato ----- Ursprüngliche Mail ---- Von: kevin chen <kevinc...@bdsing.com> An: nutch-user@lucene.apache.org Gesendet: Freitag, den 19. März 2010, 3:41:40 Uhr Betreff: Re: invertlinks: Input path does not exist Sounds like the last segment is corrupted. Did you try to remove the last segment? On Wed, 2010-03-17 at 16:10 +0000, Patricio Galeas wrote: > Hello all, > > I crawling the web using the > LanguageIdentifier plugin, but I get an error by running nutch > invertlinks. > The error occurs always by processing > the last segment (20100317010313-81). > > The problem is the same described in > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html > With both syntax variants of > invertlinks I get the same error: > a) nutch invertlinks crawl/linkdb -dir > crawl/segments > b) nutch invertlinks crawl/linkdb > crawl/segments/* > > I applied > https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java > heap problems by using the Language Identifier, but I got the same error. ;-( > > I set the NUTCH_HEAPSIZE with 6000 > (physical memory) and I merged the segments using slice=50000 > > Any idea where to look for ? > > Thanks > Pato > > --------------------hadoop.log---------------------------------- > .. > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47 > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68 > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56 > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12 > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26 > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73 > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59 > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30 > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2 > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34 > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52 > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29 > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24 > 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: > org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-81/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) > > __________________________________________________ > Do You Yahoo!? > Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz > gegen Massenmails. > http://mail.yahoo.com __________________________________________________ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com