In addition, use NoMergePolicy to prevent automatic merging once the segments were added. :-)
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Tuesday, December 30, 2014 2:20 PM > To: 'java-user@lucene.apache.org' > Subject: RE: manually merging Directories > > Hi Shaun, > > you can actually do this relatively simple. In fact, most of the files are > indeed > copied as-is, so you can theoretically change the logic to make a simple > rename. Files that cannot be copied unmodified and need to be changed by > IndexWriter, will be handled as usual. > > You don't need to patch Lucene for this: IndexWriter calls > Directory#copy(Directory to, String src, String dest, IOContext context) for > those files that can be copied unmodified. What you need to do is: Just care a > oal.store.FilterDirectory that wraps the original FSDirectory and implement > this copy method on it to just do a rename, like: > > public class RenameInsteadCopyFilterDirectory extends FilterDirectory { > public RenameInsteadCopyFilterDirectory(FSDirectory dir) { > super(dir); > } > > public void copy(Directory to, String src, String dest, IOContext context) > throws IOException { > if (!(to instanceof FSDirectory)) { > throw new IOException("This only works for target FSDirectories"); > final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS = > (FSDirectory) to; > Files.move(fromFS.getDirectory().resolve(source), > toFS.getDirectory().resolve(dest)); > } > } > > Please be aware that you have to wrap the "source" directory, because > IndexWriter's copySegmentAsIs() call this method of the directory that’s > passed to addIndexes(Directory). Something like: > > writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir)); > > After that all files, that were not copied unmodified, keep alive in the > source > directory, but all those that are copied as-is will move and disappear from > source directory. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Shaun Senecal [mailto:shaun.sene...@lithium.com] > > Sent: Tuesday, December 30, 2014 12:37 AM > > To: Lucene Users > > Subject: Re: manually merging Directories > > > > Hi Mike > > > > That's actually what I was looking at doing, I was just hoping there > > was a way to avoid the "copySegmentAsIs" step and simply replace it with a > "rename" > > operation on the file system. It seemed like low hanging fruit, but > > Uwe and Erick have now told me that the segments have dependencies > > embedded in them somehow, so a simple rename operation wouldn't > > accomplish the same thing. In the end, it may not be a big deal anyway. > > > > > > Thanks > > > > Shaun > > > > > > ________________________________________ > > From: Michael McCandless <luc...@mikemccandless.com> > > Sent: December 29, 2014 2:43 PM > > To: Lucene Users > > Subject: Re: manually merging Directories > > > > Why not use IW.addIndexes(Directory[])? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <u...@thetaphi.de> > > wrote: > > > Hi, > > > > > > Why not simply leave each index directory on the searcher nodes as is: > > > Move all index directories (as mentioned by you) to a local disk and > > > access > > them using a MultiReader - there is no need to merge them if you have > > not enough resources. If you have enough CPU and IO power, just merge > > them as usual with IndexWriter.addIndexes(). But I don't understand > > you argument with I/O: If you copy the index files from HDFS to local > > disks already, how can this work without I/O? So you can merge them > anyways. > > > > > > Merging index files, simply by copying them all in one directory, is > > impossible, because the files reference each other by segment name > > (segments_n refers to them, also the segment ids are used all over). > > So You would need to change some index files already for merge to make > > the SegmentInfos structures use the correct names, so you can do a > > real merge anyways. > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de > > > eMail: u...@thetaphi.de > > > > > > > > >> -----Original Message----- > > >> From: Shaun Senecal [mailto:shaun.sene...@lithium.com] > > >> Sent: Monday, December 29, 2014 6:34 PM > > >> To: java-user > > >> Subject: Re: manually merging Directories > > >> > > >> I'm not worried about the I/O right now, I'm "hoping I can do > > >> better", that's all. It sounds like the only actual complication > > >> here is building the segments_N file, which would list all of the > > >> newly renamed segments, so perhaps this isn't impossible. That said, > > >> you're absolutely right about the possibility of complications, so > > >> its debatable if doing something like this would be worth it in the > > >> end. Thanks for the info > > >> > > >> > > >> > > >> Shaun > > >> > > >> > > >> ________________________________________ > > >> From: Erick Erickson <erickerick...@gmail.com> > > >> Sent: December 23, 2014 5:55 PM > > >> To: java-user > > >> Subject: Re: manually merging Directories > > >> > > >> I doubt this is going to work. I have to ask why you're worried about > > >> the I/O; this smacks of premature optimization. Not only do the files > > >> have to be moved, but the right control structures need to be in > > >> place to inform Solr (well, Lucene) exactly what files are current. > > >> There's a lot of room for programming errors here.... > > >> > > >> segments_n is the file that tells Lucene which segments are active. > > >> There can only be one that's active so you'd have to somehow combine > > them all. > > >> > > >> I think this is a dubious proposition at best, all to avoid some I/O. > > >> How much I/O are we talking here? If it's a huge amount, I'm not at > > >> all sure you'll be able to _use_ your merged index. > > >> How many docs are we talking about? 100M? 10B? I mean you used M/R > > on > > >> it in the first place for a reason.... > > >> > > >> But this is what the --go-live option of the MapReduceIndexerTool > > >> already does for you. Admittedly, it copies things around the network > > >> to the final destination, personally I'd just use that. > > >> > > >> As you can tell, I don't know all the details to say it's impossible, > > >> IMO this is feels like wasted effort with lots of possibilities to > > >> get wrong for little demonstrated benefit. You'd spend a lot more > > >> time trying to figure out the correct thing to do and then fixing > > >> bugs than you'll spend waiting for the copy HDFS or no. > > >> > > >> Best, > > >> Erick > > >> > > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal > > >> <shaun.sene...@lithium.com> wrote: > > >> > Hi > > >> > > > >> > I have a number of Directories which are stored in various paths on > > >> > HDFS, > > >> and I would like to merge them into a single index. The obvious way > > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping > > >> I can do better. Since I have created each of the separate indexes > > >> using Map/Reduce, I know that there are no deleted or duplicate > > >> documents and the codecs are the same. Using addIndexes(...) will > > >> incur a lot of I/O as it copies from the source Directory into the > > >> dest Directory, and this is the bit I would like to avoid. Would it > > >> instead be possible to simply move each of the segments from each > > >> path into a single path on HDFS using a mv/rename operation instead? > > >> Obviously I would need to take care of the naming to ensure the files > > >> from one index dont overwrite another's, but it looks like this is > > >> done with a counter of some sort so that the latest segment can be > > >> found. A potential complication is the segments_1 file, as I'm not sure > > what that is for or if I can easily (re)construct it externally. > > >> > > > >> > The end goal here is to index using Map/Reduce and then spit out a > > >> > single > > >> index in the end that has been merged down to a single segment, and > > >> to minimize IO while doing it. Once I have the completed index in a > > >> single Directory, I can (optionally) perform the forced merge (which > > >> will incur a huge IO hit). If the forced merge isnt performed on > > >> HDFS, it could be done on the search nodes before the active searcher > > >> is switched. This may be better if, for example, you know all of > > >> your search nodes have SSDs and IO to spare.? > > >> > > > >> > Just in case my explanation above wasn't clear enough, here is a > > >> > picture > > >> > > > >> > What I have: > > >> > > > >> > /user/username/MR_output/0 > > >> > _0.fdt > > >> > _0.fdx > > >> > _0.fnm > > >> > _0.si > > >> > ... > > >> > segments_1 > > >> > > > >> > /user/username/MR_output/1 > > >> > _0.fdt > > >> > _0.fdx > > >> > _0.fnm > > >> > _0.si > > >> > ... > > >> > segments_1 > > >> > > > >> > > > >> > What I want (using simple mv/rename): > > >> > > > >> > /user/username/merged > > >> > _0.fdt > > >> > _0.fdx > > >> > _0.fnm > > >> > _0.si > > >> > ... > > >> > _1.fdt > > >> > _1.fdx > > >> > _1.fnm > > >> > _1.si > > >> > ... > > >> > segments_1 > > >> > > > >> > > > >> > > > >> > > > >> > Thanks, > > >> > > > >> > Shaun? > > >> > > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org