Hi Rick, I'm not able to open the url you post , so I don't know exactly how u do this. Why don't u try this: http://wiki.apache.org/nutch/MergeCrawl It works fine on my system. Hope it's helpful to u.
Rick Moynihan wrote: > > Hi all, > > I have two crawls/indexes which have been generated like so: > > ./bin/nutch crawl /home/rick//nutch/index_a -dir > /home/rick/nutch/index_a/crawl -depth 1 > > ./bin/nutch crawl /home/rick//nutch/index_b -dir > /home/rick/nutch/index_b/crawl -depth 1 > > These appear to work fine. > > I then wish to merge them creating a union of index_a and index_b called > index_c. > > So according to: > > http://ajaxtrend.wordpress.com/2007/11/29/how-to-merge-nutch-indexes-v-09/ > > I need to merge the linkdb's like so: > > ./bin/nutch mergelinkdb /home/rick/nutch/index_c/crawl/linkdb > /home/rick/nutch/index_a/crawl/linkdb > /home/rick/nutch/index_b/crawl/linkdb > > Then I merge the segments: > > ./bin/nutch mergesegs /home/rick/nutch/index_c/crawl/segments > /home/rick/nutch/index_a/crawl/segments/* > /home/rick/nutch/index_b/crawl/segments/* > > All appears fine until I try and invertlinks on the new index_c with the > command: > > ./bin/nutch invertlinks /home/rick/nutch/index_c/crawl/linkdb -dir > /home/rick/nutch/index_c/crawl/segments/ > > Which generates the following error: > > LinkDb: starting > LinkDb: linkdb: /home/rick/nutch/index_c/crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > /home/rick/nutch/index_c/crawl/segments/20080122182637 > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path > doesnt exist : > /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data > at > org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:302) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:258) > > After this error, attempting the final stage of running the command: > > ./bin/nutch index /home/rick/nutch/index_c/crawl/indexes > /home/rick/nutch/index_c/crawl/linkdb/ > /home/rick/nutch/index_c/crawl/crawldb/ > /home/rick/nutch/index_c/crawl/segments/* > > generates the following error: > > Indexer: starting > Indexer: linkdb: /home/rick/nutch/index_c/crawl/crawldb > Indexer: adding segment: > /home/rick/nutch/index_c/crawl/segments/20080122182637 > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path > doesnt exist : > /home/rick/nutch/index_c/crawl/segments/20080122182637/crawl_parse > Input path doesnt exist : > /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_data > Input path doesnt exist : > /home/rick/nutch/index_c/crawl/segments/20080122182637/parse_text > at > org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) > at org.apache.nutch.indexer.Indexer.index(Indexer.java:307) > at org.apache.nutch.indexer.Indexer.run(Indexer.java:329) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.indexer.Indexer.main(Indexer.java:312) > > I'm not sure what I'm doing wrong here. Does anyone know why this is > failing and how to resolve it? > > Any help appreciated. > > Thanks again, > > > -- > Rick Moynihan > Software Engineer > Calico Jack LTD > http://calicojack.co.uk/ > > -- View this message in context: http://www.nabble.com/Problem-merging-two-indexes--nutch-0.9-dev--%28Input-path-doesnt-exist%29-tp15026326p15048936.html Sent from the Nutch - User mailing list archive at Nabble.com.
