Problem merging two indexes [nutch-0.9-dev] (Input path doesnt exist)

Rick Moynihan Tue, 22 Jan 2008 11:27:12 -0800

Hi all,

I have two crawls/indexes which have been generated like so:

./bin/nutch crawl /home/rick//nutch/index_a -dir/home/rick/nutch/index_a/crawl -depth 1

./bin/nutch crawl /home/rick//nutch/index_b -dir/home/rick/nutch/index_b/crawl -depth 1


These appear to work fine.

I then wish to merge them creating a union of index_a and index_b calledindex_c.


So according to:

http://ajaxtrend.wordpress.com/2007/11/29/how-to-merge-nutch-indexes-v-09/

I need to merge the linkdb's like so:

./bin/nutch mergelinkdb /home/rick/nutch/index_c/crawl/linkdb/home/rick/nutch/index_a/crawl/linkdb /home/rick/nutch/index_b/crawl/linkdb


Then I merge the segments:

./bin/nutch mergesegs /home/rick/nutch/index_c/crawl/segments/home/rick/nutch/index_a/crawl/segments/*/home/rick/nutch/index_b/crawl/segments/*

All appears fine until I try and invertlinks on the new index_c with thecommand:

./bin/nutch invertlinks /home/rick/nutch/index_c/crawl/linkdb -dir/home/rick/nutch/index_c/crawl/segments/


Which generates the following error:

LinkDb: starting
LinkDb: linkdb: /home/rick/nutch/index_c/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true

LinkDb: adding segment:/home/rick/nutch/index_c/crawl/segments/20080122182637LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input pathdoesnt exist :/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_dataatorg.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)

        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:302)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:258)

After this error, attempting the final stage of running the command:

./bin/nutch index /home/rick/nutch/index_c/crawl/indexes/home/rick/nutch/index_c/crawl/linkdb//home/rick/nutch/index_c/crawl/crawldb//home/rick/nutch/index_c/crawl/segments/*


generates the following error:

Indexer: starting
Indexer: linkdb: /home/rick/nutch/index_c/crawl/crawldb

Indexer: adding segment:/home/rick/nutch/index_c/crawl/segments/20080122182637Indexer: org.apache.hadoop.mapred.InvalidInputException: Input pathdoesnt exist :/home/rick/nutch/index_c/crawl/segments/20080122182637/crawl_parseInput path doesnt exist :/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_dataInput path doesnt exist :/home/rick/nutch/index_c/crawl/segments/20080122182637/parse_textatorg.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)

        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:307)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:329)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:312)

I'm not sure what I'm doing wrong here. Does anyone know why this isfailing and how to resolve it?


Any help appreciated.

Thanks again,


--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://calicojack.co.uk/

Problem merging two indexes [nutch-0.9-dev] (Input path doesnt exist)

Reply via email to