Hi

I have a number of Directories which are stored in various paths on HDFS, and I 
would like to merge them into a single index.  The obvious way to do this is to 
use IndexWriter.addIndexes(...), however, I'm hoping I can do better.  Since I 
have created each of the separate indexes using Map/Reduce, I know that there 
are no deleted or duplicate documents and the codecs are the same.  Using 
addIndexes(...) will incur a lot of I/O as it copies from the source Directory 
into the dest Directory, and this is the bit I would like to avoid.  Would it 
instead be possible to simply move each of the segments from each path into a 
single path on HDFS using a mv/rename operation instead?  Obviously I would 
need to take care of the naming to ensure the files from one index dont 
overwrite another's, but it looks like this is done with a counter of some sort 
so that the latest segment can be found. A potential complication is the 
segments_1 file, as I'm not sure what that is for or if I can easily 
(re)construct it externally.

The end goal here is to index using Map/Reduce and then spit out a single index 
in the end that has been merged down to a single segment, and to minimize IO 
while doing it.  Once I have the completed index in a single Directory, I can 
(optionally) perform the forced merge (which will incur a huge IO hit).  If the 
forced merge isnt performed on HDFS, it could be done on the search nodes 
before the active searcher is switched.  This may be better if, for example, 
you know all of your search nodes have SSDs and IO to spare.?

Just in case my explanation above wasn't clear enough, here is a picture

What I have:

/user/username/MR_output/0
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  segments_1

/user/username/MR_output/1
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  segments_1


What I want (using simple mv/rename):

/user/username/merged
  _0.fdt
  _0.fdx
  _0.fnm
  _0.si
  ...
  _1.fdt
  _1.fdx
  _1.fnm
  _1.si
  ...
  segments_1




Thanks,

Shaun?

Reply via email to