On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:

What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another.  KS has
solved this problem for stored fields.  Field defs are global and
field values are keyed by name rather than field number in the field
data file.  Benefits:

   * Whole documents can be read from one segment to
     another as blobs.
   * No flags byte.
   * No remapping of field numbers.
   * No conflict resolution at all.
   * Compressed, uncompressed... doesn't matter.
   * Less code.
   * The possibility of allowing the user to provide their
     own subclass for reading and writing fields. (For
     Lucy, in the language of your choice.)

I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

Yeah, too bad. This is one area where Lucene and Lucy are going to differ. Balmain and I are of one mind about global field defs.

I think the ability to suddenly birth a new field,

You can do that in KS as of version 0.20_02.  :)

or change a field's attributes like "has vectors", "stores norms",
etc., with a new document,

Can't do that, though, and I make no apologies. I think it's a misfeature.

I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.

You'll still have to be able to remap field numbers when adding entire indexes.

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.

You can probably squeeze out some nice gains using a skipVint() function, even with deletions.

But for freq postings, you can't, because they are delta coded.

I'm working on this task right now for KS.

KS implements the "Flexible Indexing" paradigm, so all posting data goes in a single file.

I've applied an additional constraint to KS: Every binary file must consist of one type of record repeated over and over. Every indexed field gets its own dedicated posting file with the suffix .pNNN to allow per-field posting formats.

The I/O code is isolated in subclasses of a new class called "Stepper": You can turn any Stepper loose on its file and read it from top to tail. When the file format changes, Steppers will get archived, like old plugins.

My present task is to write the code for the Stepper subclasses MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can wait.) As I write them, I will see if I can figure out format that can be merged as speedily as possible. Perhaps the precise variant of delta encoding used in Lucene's .frq file should be avoided.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?

Ugh, more special case code.

I have to say, I started trying to go over your patch, and the overwhelming impression I got coming back to this part of the Lucene code base in earnest for the first time since using 1.4.3 as a porting reference was: simplicity seems to be nobody's priority these days.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to