I have generated lots of indexes for individual site using nutch and was 
looking for a way to merge all indexes into one index to be used in live 
system. I was really struggling to merge them all and finally I could able to 
find the way. Here are the steps
  Lets say, you have two working indexes i.e. crawl1 and crawl2. I am assuming 
that all these have following directories generated by bin/nutch crawl command
  crawldb
  index
  indexes
  linkdb
  segments
  Now you are on the parent directory that contains folder gcrawl1 and 
gcrawl2
  You need to merge individual dbs ie. linkdb, crawldb and segments. Then you 
needs to generate index.
  Please create a directory called mergeaall. This directory would contain all 
merged linkdb, crawldb and segments.
  - Merge linkdbs
bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*
  - Merge crawldbs
  bin/nutch index mergeaall/linkdb/ mergeaall/crawldb/ mergeaall/segments/*
  - Merge segments
  bin/nutch mergesegs mergeaall/segments crawl/segments/* 
crawl-rediff/segments/*
  - Invertlinks
  bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*
  Now run index command to create nutch index
  bin/nutch index mergeaall/indexes mergeaall/linkdb/ mergeaall/crawldb/ 
mergeaall/segments/*
   
  If there is a direct way of achieving this, then please let me know.
   
  Check out www.ajaxtrend.com fs search facility where I have merged couple of 
nutch indexes.


       
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Reply via email to