Hello all, I crawling the web using the LanguageIdentifier plugin, but I get an error by running nutch invertlinks. The error occurs always by processing the last segment (20100317010313-81).
The problem is the same described in http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html With both syntax variants of invertlinks I get the same error: a) nutch invertlinks crawl/linkdb -dir crawl/segments b) nutch invertlinks crawl/linkdb crawl/segments/* I applied https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java heap problems by using the Language Identifier, but I got the same error. ;-( I set the NUTCH_HEAPSIZE with 6000 (physical memory) and I merged the segments using slice=50000 Any idea where to look for ? Thanks Pato --------------------hadoop.log---------------------------------- .. 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47 2010-03-17 02:33:25,107 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12 2010-03-17 02:33:25,108 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59 2010-03-17 02:33:25,109 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2 2010-03-17 02:33:25,110 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29 2010-03-17 02:33:25,111 INFO crawl.LinkDb - LinkDb: adding segment: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24 2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-81/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) __________________________________________________ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com