Hello all,

I crawling the web using the
LanguageIdentifier plugin, but I get an error by running nutch
invertlinks.
The error occurs always by processing
the last segment (20100317010313-81).

The problem is the same described in
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg14776.html
With both syntax variants of
invertlinks I get the same error:
a) nutch invertlinks crawl/linkdb -dir
crawl/segments
b) nutch invertlinks crawl/linkdb
crawl/segments/*

I applied
https://issues.apache.org/jira/browse/NUTCH-356 to avoid some java
heap problems by using the Language Identifier, but I got the same error. ;-(

I set the NUTCH_HEAPSIZE with 6000
(physical memory) and I merged the segments using slice=50000

Any idea where to look for ?

Thanks
Pato

--------------------hadoop.log----------------------------------
..
2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-47
2010-03-17 02:33:25,107 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-68
2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-56
2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-12
2010-03-17 02:33:25,108 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-26
2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-73
2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-59
2010-03-17 02:33:25,109 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-30
2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-2
2010-03-17 02:33:25,110 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-34
2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-52
2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-29
2010-03-17 02:33:25,111 INFO  crawl.LinkDb - LinkDb: adding segment: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-24
2010-03-17 02:33:25,610 FATAL crawl.LinkDb - LinkDb: 
org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
file:/mnt/nutch-1.0/crawl_al/segments/20100317010313-81/parse_data
        at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
        at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
        at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com

Reply via email to