Hi all,

I have two crawls/indexes which have been generated like so:

./bin/nutch crawl /home/rick//nutch/index_a -dir /home/rick/nutch/index_a/crawl -depth 1

./bin/nutch crawl /home/rick//nutch/index_b -dir /home/rick/nutch/index_b/crawl -depth 1

These appear to work fine.

I then wish to merge them creating a union of index_a and index_b called index_c.

So according to:

http://ajaxtrend.wordpress.com/2007/11/29/how-to-merge-nutch-indexes-v-09/

I need to merge the linkdb's like so:

./bin/nutch mergelinkdb /home/rick/nutch/index_c/crawl/linkdb /home/rick/nutch/index_a/crawl/linkdb /home/rick/nutch/index_b/crawl/linkdb

Then I merge the segments:

./bin/nutch mergesegs /home/rick/nutch/index_c/crawl/segments /home/rick/nutch/index_a/crawl/segments/* /home/rick/nutch/index_b/crawl/segments/*

All appears fine until I try and invertlinks on the new index_c with the command:

./bin/nutch invertlinks /home/rick/nutch/index_c/crawl/linkdb -dir /home/rick/nutch/index_c/crawl/segments/

Which generates the following error:

LinkDb: starting
LinkDb: linkdb: /home/rick/nutch/index_c/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: /home/rick/nutch/index_c/crawl/segments/20080122182637 LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:302)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:258)

After this error, attempting the final stage of running the command:

./bin/nutch index /home/rick/nutch/index_c/crawl/indexes /home/rick/nutch/index_c/crawl/linkdb/ /home/rick/nutch/index_c/crawl/crawldb/ /home/rick/nutch/index_c/crawl/segments/*

generates the following error:

Indexer: starting
Indexer: linkdb: /home/rick/nutch/index_c/crawl/crawldb
Indexer: adding segment: /home/rick/nutch/index_c/crawl/segments/20080122182637 Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /home/rick/nutch/index_c/crawl/segments/20080122182637/crawl_parse Input path doesnt exist : /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data Input path doesnt exist : /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_text at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:307)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)

I'm not sure what I'm doing wrong here. Does anyone know why this is failing and how to resolve it?

Any help appreciated.

Thanks again,


--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://calicojack.co.uk/

Reply via email to