I had similar problems caused by lack of space in temp directory. To solve, edited hadoop-site.xml and set hadoop.tmp.dir to a directory with plenty of space.
> -----Original Message----- > From: kevin chen [mailto:kevinc...@bdsing.com] > Sent: Friday, March 19, 2010 1:42 PM > To: nutch-user@lucene.apache.org > Subject: Re: invertlinks: Input path does not exist > > Sounds like the last segment is corrupted. > Did you try to remove the last segment? > > On Wed, 2010-03-17 at 16:10 +0000, Patricio Galeas wrote: > > Hello all, > > > > I crawling the web using the > > LanguageIdentifier plugin, but I get an error by running nutch > > invertlinks. > > The error occurs always by processing > > the last segment (20100317010313-81). > > > > The problem is the same described in > > http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html > > With both syntax variants of > > invertlinks I get the same error: > > a) nutch invertlinks crawl/linkdb -dir > > crawl/segments > > b) nutch invertlinks crawl/linkdb > > crawl/segments/* > > > > I applied > > https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java > > heap problems by using the Language Identifier, but I got the same > error. ;-( > > > > I set the NUTCH_HEAPSIZE with 6000 > > (physical memory) and I merged the segments using slice=50000 > > > > Any idea where to look for ? > > > > Thanks > > Pato > > > > --------------------hadoop.log---------------------------------- > > .. > > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47 > > 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68 > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56 > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12 > > 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26 > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73 > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59 > > 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30 > > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2 > > 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34 > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52 > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29 > > 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: > file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24 > > 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: > org.apache.hadoop.mapred.InvalidInputException: > > Input path does not exist: file:/mnt/nutch- > 1.0/crawl_al/segments/20100317010313-81/parse_data > > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1 > 79) > > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn > putFormat.java:39) > > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:19 > 0) > > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) > > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) > > > > __________________________________________________ > > Do You Yahoo!? > > Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz > gegen Massenmails. > > http://mail.yahoo.com