On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another. KS has
solved this problem for stored fields. Field defs are global and
field values are keyed by name rather than field number in the field
data file. Benefits:
* Whole documents can be read from one segment to
another as blobs.
* No flags byte.
* No remapping of field numbers.
* No conflict resolution at all.
* Compressed, uncompressed... doesn't matter.
* Less code.
* The possibility of allowing the user to provide their
own subclass for reading and writing fields. (For
Lucy, in the language of your choice.)
I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.
Yeah, too bad. This is one area where Lucene and Lucy are going to
differ. Balmain and I are of one mind about global field defs.
I think the ability to suddenly birth a new field,
You can do that in KS as of version 0.20_02. :)
or change a field's attributes like "has vectors", "stores norms",
etc., with a new document,
Can't do that, though, and I make no apologies. I think it's a
misfeature.
I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits? Hmmm.
You'll still have to be able to remap field numbers when adding
entire indexes.
Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X. I think for prox postings
you can.
You can probably squeeze out some nice gains using a skipVint()
function, even with deletions.
But for freq postings, you can't, because they are delta coded.
I'm working on this task right now for KS.
KS implements the "Flexible Indexing" paradigm, so all posting data
goes in a single file.
I've applied an additional constraint to KS: Every binary file must
consist of one type of record repeated over and over. Every indexed
field gets its own dedicated posting file with the suffix .pNNN to
allow per-field posting formats.
The I/O code is isolated in subclasses of a new class called
"Stepper": You can turn any Stepper loose on its file and read it
from top to tail. When the file format changes, Steppers will get
archived, like old plugins.
My present task is to write the code for the Stepper subclasses
MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can
wait.) As I write them, I will see if I can figure out format that
can be merged as speedily as possible. Perhaps the precise variant
of delta encoding used in Lucene's .frq file should be avoided.
Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted? So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting? I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?
Ugh, more special case code.
I have to say, I started trying to go over your patch, and the
overwhelming impression I got coming back to this part of the Lucene
code base in earnest for the first time since using 1.4.3 as a
porting reference was: simplicity seems to be nobody's priority these
days.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]