Re: improve how IndexWriter uses RAM to buffer added documents

Michael McCandless Fri, 06 Apr 2007 01:51:40 -0700

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
>
> On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
> 
> >> What we need to do is cut down on decompression and conflict
> >> resolution costs when reading from one segment to another.  KS has
> >> solved this problem for stored fields.  Field defs are global and
> >> field values are keyed by name rather than field number in the field
> >> data file.  Benefits:
> >>
> >>    * Whole documents can be read from one segment to
> >>      another as blobs.
> >>    * No flags byte.
> >>    * No remapping of field numbers.
> >>    * No conflict resolution at all.
> >>    * Compressed, uncompressed... doesn't matter.
> >>    * Less code.
> >>    * The possibility of allowing the user to provide their
> >>      own subclass for reading and writing fields. (For
> >>      Lucy, in the language of your choice.)
> >
> > I hear you, and I really really love those benefits, but, we just
> > don't have this freedom with Lucene.
> 
> Yeah, too bad.  This is one area where Lucene and Lucy are going to  
> differ.  Balmain and I are of one mind about global field defs.
> 
> > I think the ability to suddenly birth a new field,
> 
> You can do that in KS as of version 0.20_02.  :)


Excellent!

> > or change a field's attributes like "has vectors", "stores norms",
> > etc., with a new document,
> 
> Can't do that, though, and I make no apologies.  I think it's a  
> misfeature.

Alas, I don't think we (Lucene) can change this now.

> > I suppose if we had a
> > single mapping of field names -> numbers in the index, that would gain
> > us many of the above benefits?  Hmmm.
> 
> You'll still have to be able to remap field numbers when adding  
> entire indexes.

True, but it'd still be good progress for the frequent case of
adding/deleting docs to an existing index.  Progress not perfection...

> > Here's one idea I just had: assuming there are no deletions, you can
> > almost do a raw bytes copy from input segment to output (merged)
> > segment of the postings for a given term X.  I think for prox postings
> > you can.
> 
> You can probably squeeze out some nice gains using a skipVint()  
> function, even with deletions.

Good point.  I think likewise with copyVInt(int numToCopy).

> > But for freq postings, you can't, because they are delta coded.
> 
> I'm working on this task right now for KS.
> 
> KS implements the "Flexible Indexing" paradigm, so all posting data  
> goes in a single file.
> 
> I've applied an additional constraint to KS:  Every binary file must  
> consist of one type of record repeated over and over.  Every indexed  
> field gets its own dedicated posting file with the suffix .pNNN to  
> allow per-field posting formats.
> 
> The I/O code is isolated in subclasses of a new class called  
> "Stepper":  You can turn any Stepper loose on its file and read it  
> from top to tail.  When the file format changes, Steppers will get  
> archived, like old plugins.
> 
> My present task is to write the code for the Stepper subclasses  
> MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
> wait.)  As I write them, I will see if I can figure out format that  
> can be merged as speedily as possible.  Perhaps the precise variant  
> of delta encoding used in Lucene's .frq file should be avoided.

Neat!  Yes, designing the file format to accommodate "merging"
efficiently (plus searching of course) is a good idea since we lose so
much indexing time to this.

> > Except: it's only the first entry of the incoming segments's freq
> > postings that needs to be re-interpreted?  So you could read that one,
> > encode the delta based on "last docID" for previous segment (I think
> > we'd have to store this in index, probably only if termFreq >
> > threshold), and then copyBytes the rest of the posting?  I will try
> > this out on the merges I'm doing in LUCENE-843; I think it should
> > work and make merging faster (assuming no deletes)?
> 
> Ugh, more special case code.
> 
> I have to say, I started trying to go over your patch, and the  
> overwhelming impression I got coming back to this part of the Lucene  
> code base in earnest for the first time since using 1.4.3 as a  
> porting reference was: simplicity seems to be nobody's priority these  
> days.

Unfortunately this is just a tough tradeoff... higher performance code
is often not "simple".  I also still need to clean up the code, add
comments, etc, but even after that, it's not going to look "simple".
I think this is just the reality of performance optimization.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to