I have generated lots of indexes for individual site using nutch and was
looking for a way to merge all indexes into one index to be used in live
system. I was really struggling to merge them all and finally I could able to
find the way. Here are the steps
Lets say, you have two working indexes i.e. crawl1 and crawl2. I am assuming
that all these have following directories generated by bin/nutch crawl command
crawldb
index
indexes
linkdb
segments
Now you are on the parent directory that contains folder gcrawl1 and
gcrawl2
You need to merge individual dbs ie. linkdb, crawldb and segments. Then you
needs to generate index.
Please create a directory called mergeaall. This directory would contain all
merged linkdb, crawldb and segments.
- Merge linkdbs
bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*
- Merge crawldbs
bin/nutch index mergeaall/linkdb/ mergeaall/crawldb/ mergeaall/segments/*
- Merge segments
bin/nutch mergesegs mergeaall/segments crawl/segments/*
crawl-rediff/segments/*
- Invertlinks
bin/nutch invertlinks mergeaall/linkdb/ mergeaall/segments/*
Now run index command to create nutch index
bin/nutch index mergeaall/indexes mergeaall/linkdb/ mergeaall/crawldb/
mergeaall/segments/*
If there is a direct way of achieving this, then please let me know.
Check out www.ajaxtrend.com fs search facility where I have merged couple of
nutch indexes.
---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.