Re: improve how IndexWriter uses RAM to buffer added documents

Marvin Humphrey Thu, 05 Apr 2007 18:45:35 -0700


On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:

What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another.  KS has
solved this problem for stored fields.  Field defs are global and
field values are keyed by name rather than field number in the field
data file.  Benefits:

   * Whole documents can be read from one segment to
     another as blobs.
   * No flags byte.
   * No remapping of field numbers.
   * No conflict resolution at all.
   * Compressed, uncompressed... doesn't matter.
   * Less code.
   * The possibility of allowing the user to provide their
     own subclass for reading and writing fields. (For
     Lucy, in the language of your choice.)


I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

Yeah, too bad. This is one area where Lucene and Lucy are going todiffer. Balmain and I are of one mind about global field defs.

I think the ability to suddenly birth a new field,


You can do that in KS as of version 0.20_02.  :)

or change a field's attributes like "has vectors", "stores norms",
etc., with a new document,

Can't do that, though, and I make no apologies. I think it's amisfeature.

I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.

You'll still have to be able to remap field numbers when addingentire indexes.

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.

You can probably squeeze out some nice gains using a skipVint()function, even with deletions.

But for freq postings, you can't, because they are delta coded.


I'm working on this task right now for KS.

KS implements the "Flexible Indexing" paradigm, so all posting datagoes in a single file.

I've applied an additional constraint to KS: Every binary file mustconsist of one type of record repeated over and over. Every indexedfield gets its own dedicated posting file with the suffix .pNNN toallow per-field posting formats.

The I/O code is isolated in subclasses of a new class called"Stepper": You can turn any Stepper loose on its file and read itfrom top to tail. When the file format changes, Steppers will getarchived, like old plugins.

My present task is to write the code for the Stepper subclassesMatchPosting, ScorePosting, and RichPosting. (PayloadPosting canwait.) As I write them, I will see if I can figure out format thatcan be merged as speedily as possible. Perhaps the precise variantof delta encoding used in Lucene's .frq file should be avoided.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?


Ugh, more special case code.

I have to say, I started trying to go over your patch, and theoverwhelming impression I got coming back to this part of the Lucenecode base in earnest for the first time since using 1.4.3 as aporting reference was: simplicity seems to be nobody's priority thesedays.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to