RE: manually merging Directories

Uwe Schindler Tue, 30 Dec 2014 05:24:08 -0800

In addition, use NoMergePolicy to prevent automatic merging once the segments 
were added. :-)


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Uwe Schindler [mailto:[email protected]]
> Sent: Tuesday, December 30, 2014 2:20 PM
> To: '[email protected]'
> Subject: RE: manually merging Directories
> 
> Hi Shaun,
> 
> you can actually do this relatively simple. In fact, most of the files are 
> indeed
> copied as-is, so you can theoretically change the logic to make a simple
> rename. Files that cannot be copied unmodified and need to be changed by
> IndexWriter, will be handled as usual.
> 
> You don't need to patch Lucene for this: IndexWriter calls
> Directory#copy(Directory to, String src, String dest, IOContext context) for
> those files that can be copied unmodified. What you need to do is: Just care a
> oal.store.FilterDirectory that wraps the original FSDirectory and implement
> this copy method on it to just do a rename, like:
> 
> public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
>   public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
>     super(dir);
>   }
> 
>   public void copy(Directory to, String src, String dest, IOContext context)
> throws IOException {
>     if (!(to instanceof FSDirectory)) {
>      throw new IOException("This only works for target FSDirectories");
>     final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS =
> (FSDirectory) to;
>     Files.move(fromFS.getDirectory().resolve(source),
> toFS.getDirectory().resolve(dest));
>   }
> }
> 
> Please be aware that you have to wrap the "source" directory, because
> IndexWriter's copySegmentAsIs() call this method of the directory that’s
> passed to addIndexes(Directory). Something like:
> 
> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));
> 
> After that all files, that were not copied unmodified, keep alive in the 
> source
> directory, but all those that are copied as-is will move and disappear from
> source directory.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
> 
> 
> > -----Original Message-----
> > From: Shaun Senecal [mailto:[email protected]]
> > Sent: Tuesday, December 30, 2014 12:37 AM
> > To: Lucene Users
> > Subject: Re: manually merging Directories
> >
> > Hi Mike
> >
> > That's actually what I was looking at doing, I was just hoping there
> > was a way to avoid the "copySegmentAsIs" step and simply replace it with a
> "rename"
> > operation on the file system.  It seemed like low hanging fruit, but
> > Uwe and Erick have now told me that the segments have dependencies
> > embedded in them somehow, so a simple rename operation wouldn't
> > accomplish the same thing.  In the end, it may not be a big deal anyway.
> >
> >
> > Thanks
> >
> > Shaun
> >
> >
> > ________________________________________
> > From: Michael McCandless <[email protected]>
> > Sent: December 29, 2014 2:43 PM
> > To: Lucene Users
> > Subject: Re: manually merging Directories
> >
> > Why not use IW.addIndexes(Directory[])?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <[email protected]>
> > wrote:
> > > Hi,
> > >
> > > Why not simply leave each index directory on the searcher nodes as is:
> > > Move all index directories (as mentioned by you) to a local disk and
> > > access
> > them using a MultiReader - there is no need to merge them if you have
> > not enough resources. If you have enough CPU and IO power, just merge
> > them as usual with IndexWriter.addIndexes(). But I don't understand
> > you argument with I/O: If you copy the index files from HDFS to local
> > disks already, how can this work without I/O? So you can merge them
> anyways.
> > >
> > > Merging index files, simply by copying them all in one directory, is
> > impossible, because the files reference each other by segment name
> > (segments_n refers to them, also the segment ids are used all over).
> > So You would need to change some index files already for merge to make
> > the SegmentInfos structures use the correct names, so you can do a
> > real merge anyways.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > eMail: [email protected]
> > >
> > >
> > >> -----Original Message-----
> > >> From: Shaun Senecal [mailto:[email protected]]
> > >> Sent: Monday, December 29, 2014 6:34 PM
> > >> To: java-user
> > >> Subject: Re: manually merging Directories
> > >>
> > >> I'm not worried about the I/O right now, I'm "hoping I can do
> > >> better", that's all.  It sounds like the only actual complication
> > >> here is building the segments_N file, which would list all of the
> > >> newly renamed segments, so perhaps this isn't impossible.  That said,
> > >> you're absolutely right about the possibility of complications, so
> > >> its debatable if doing something like this would be worth it in the
> > >> end.  Thanks for the info
> > >>
> > >>
> > >>
> > >> Shaun
> > >>
> > >>
> > >> ________________________________________
> > >> From: Erick Erickson <[email protected]>
> > >> Sent: December 23, 2014 5:55 PM
> > >> To: java-user
> > >> Subject: Re: manually merging Directories
> > >>
> > >> I doubt this is going to work. I have to ask why you're worried about
> > >> the I/O; this smacks of premature optimization. Not only do the files
> > >> have to be moved, but the right control structures need to be in
> > >> place to inform Solr (well, Lucene) exactly what files are current.
> > >> There's a lot of room for programming errors here....
> > >>
> > >> segments_n is the file that tells Lucene which segments are active.
> > >> There can only be one that's active so you'd have to somehow combine
> > them all.
> > >>
> > >> I think this is a dubious proposition at best, all to avoid some I/O.
> > >> How much I/O are we talking here? If it's a huge amount, I'm not at
> > >> all sure you'll be able to _use_ your merged index.
> > >> How many docs are we talking about? 100M? 10B? I mean you used M/R
> > on
> > >> it in the first place for a reason....
> > >>
> > >> But this is what the --go-live option of the MapReduceIndexerTool
> > >> already does for you. Admittedly, it copies things around the network
> > >> to the final destination, personally I'd just use that.
> > >>
> > >> As you can tell, I don't know all the details to say it's impossible,
> > >> IMO this is feels like wasted effort with lots of possibilities to
> > >> get wrong for little demonstrated benefit. You'd spend a lot more
> > >> time trying to figure out the correct thing to do and then fixing
> > >> bugs than you'll spend waiting for the copy HDFS or no.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
> > >> <[email protected]> wrote:
> > >> > Hi
> > >> >
> > >> > I have a number of Directories which are stored in various paths on
> > >> > HDFS,
> > >> and I would like to merge them into a single index.  The obvious way
> > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping
> > >> I can do better.  Since I have created each of the separate indexes
> > >> using Map/Reduce, I know that there are no deleted or duplicate
> > >> documents and the codecs are the same.  Using addIndexes(...) will
> > >> incur a lot of I/O as it copies from the source Directory into the
> > >> dest Directory, and this is the bit I would like to avoid.  Would it
> > >> instead be possible to simply move each of the segments from each
> > >> path into a single path on HDFS using a mv/rename operation instead?
> > >> Obviously I would need to take care of the naming to ensure the files
> > >> from one index dont overwrite another's, but it looks like this is
> > >> done with a counter of some sort so that the latest segment can be
> > >> found. A potential complication is the segments_1 file, as I'm not sure
> > what that is for or if I can easily (re)construct it externally.
> > >> >
> > >> > The end goal here is to index using Map/Reduce and then spit out a
> > >> > single
> > >> index in the end that has been merged down to a single segment, and
> > >> to minimize IO while doing it.  Once I have the completed index in a
> > >> single Directory, I can (optionally) perform the forced merge (which
> > >> will incur a huge IO hit).  If the forced merge isnt performed on
> > >> HDFS, it could be done on the search nodes before the active searcher
> > >> is switched.  This may be better if, for example, you know all of
> > >> your search nodes have SSDs and IO to spare.?
> > >> >
> > >> > Just in case my explanation above wasn't clear enough, here is a
> > >> > picture
> > >> >
> > >> > What I have:
> > >> >
> > >> > /user/username/MR_output/0
> > >> >   _0.fdt
> > >> >   _0.fdx
> > >> >   _0.fnm
> > >> >   _0.si
> > >> >   ...
> > >> >   segments_1
> > >> >
> > >> > /user/username/MR_output/1
> > >> >   _0.fdt
> > >> >   _0.fdx
> > >> >   _0.fnm
> > >> >   _0.si
> > >> >   ...
> > >> >   segments_1
> > >> >
> > >> >
> > >> > What I want (using simple mv/rename):
> > >> >
> > >> > /user/username/merged
> > >> >   _0.fdt
> > >> >   _0.fdx
> > >> >   _0.fnm
> > >> >   _0.si
> > >> >   ...
> > >> >   _1.fdt
> > >> >   _1.fdx
> > >> >   _1.fnm
> > >> >   _1.si
> > >> >   ...
> > >> >   segments_1
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Shaun?
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: manually merging Directories

Reply via email to