Hi I have a number of Directories which are stored in various paths on HDFS, and I would like to merge them into a single index. The obvious way to do this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do better. Since I have created each of the separate indexes using Map/Reduce, I know that there are no deleted or duplicate documents and the codecs are the same. Using addIndexes(...) will incur a lot of I/O as it copies from the source Directory into the dest Directory, and this is the bit I would like to avoid. Would it instead be possible to simply move each of the segments from each path into a single path on HDFS using a mv/rename operation instead? Obviously I would need to take care of the naming to ensure the files from one index dont overwrite another's, but it looks like this is done with a counter of some sort so that the latest segment can be found. A potential complication is the segments_1 file, as I'm not sure what that is for or if I can easily (re)construct it externally.
The end goal here is to index using Map/Reduce and then spit out a single index in the end that has been merged down to a single segment, and to minimize IO while doing it. Once I have the completed index in a single Directory, I can (optionally) perform the forced merge (which will incur a huge IO hit). If the forced merge isnt performed on HDFS, it could be done on the search nodes before the active searcher is switched. This may be better if, for example, you know all of your search nodes have SSDs and IO to spare.? Just in case my explanation above wasn't clear enough, here is a picture What I have: /user/username/MR_output/0 _0.fdt _0.fdx _0.fnm _0.si ... segments_1 /user/username/MR_output/1 _0.fdt _0.fdx _0.fnm _0.si ... segments_1 What I want (using simple mv/rename): /user/username/merged _0.fdt _0.fdx _0.fnm _0.si ... _1.fdt _1.fdx _1.fnm _1.si ... segments_1 Thanks, Shaun?