Hi all,
I have two crawls/indexes which have been generated like so:
./bin/nutch crawl /home/rick//nutch/index_a -dir
/home/rick/nutch/index_a/crawl -depth 1
./bin/nutch crawl /home/rick//nutch/index_b -dir
/home/rick/nutch/index_b/crawl -depth 1
These appear to work fine.
I then wish to merge them creating a union of index_a and index_b called
index_c.
So according to:
http://ajaxtrend.wordpress.com/2007/11/29/how-to-merge-nutch-indexes-v-09/
I need to merge the linkdb's like so:
./bin/nutch mergelinkdb /home/rick/nutch/index_c/crawl/linkdb
/home/rick/nutch/index_a/crawl/linkdb /home/rick/nutch/index_b/crawl/linkdb
Then I merge the segments:
./bin/nutch mergesegs /home/rick/nutch/index_c/crawl/segments
/home/rick/nutch/index_a/crawl/segments/*
/home/rick/nutch/index_b/crawl/segments/*
All appears fine until I try and invertlinks on the new index_c with the
command:
./bin/nutch invertlinks /home/rick/nutch/index_c/crawl/linkdb -dir
/home/rick/nutch/index_c/crawl/segments/
Which generates the following error:
LinkDb: starting
LinkDb: linkdb: /home/rick/nutch/index_c/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
/home/rick/nutch/index_c/crawl/segments/20080122182637
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
doesnt exist :
/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:302)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:258)
After this error, attempting the final stage of running the command:
./bin/nutch index /home/rick/nutch/index_c/crawl/indexes
/home/rick/nutch/index_c/crawl/linkdb/
/home/rick/nutch/index_c/crawl/crawldb/
/home/rick/nutch/index_c/crawl/segments/*
generates the following error:
Indexer: starting
Indexer: linkdb: /home/rick/nutch/index_c/crawl/crawldb
Indexer: adding segment:
/home/rick/nutch/index_c/crawl/segments/20080122182637
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path
doesnt exist :
/home/rick/nutch/index_c/crawl/segments/20080122182637/crawl_parse
Input path doesnt exist :
/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data
Input path doesnt exist :
/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_text
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:307)
at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)
I'm not sure what I'm doing wrong here. Does anyone know why this is
failing and how to resolve it?
Any help appreciated.
Thanks again,
--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://calicojack.co.uk/