"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > > On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote: > > >> What we need to do is cut down on decompression and conflict > >> resolution costs when reading from one segment to another. KS has > >> solved this problem for stored fields. Field defs are global and > >> field values are keyed by name rather than field number in the field > >> data file. Benefits: > >> > >> * Whole documents can be read from one segment to > >> another as blobs. > >> * No flags byte. > >> * No remapping of field numbers. > >> * No conflict resolution at all. > >> * Compressed, uncompressed... doesn't matter. > >> * Less code. > >> * The possibility of allowing the user to provide their > >> own subclass for reading and writing fields. (For > >> Lucy, in the language of your choice.) > > > > I hear you, and I really really love those benefits, but, we just > > don't have this freedom with Lucene. > > Yeah, too bad. This is one area where Lucene and Lucy are going to > differ. Balmain and I are of one mind about global field defs. > > > I think the ability to suddenly birth a new field, > > You can do that in KS as of version 0.20_02. :)
Excellent! > > or change a field's attributes like "has vectors", "stores norms", > > etc., with a new document, > > Can't do that, though, and I make no apologies. I think it's a > misfeature. Alas, I don't think we (Lucene) can change this now. > > I suppose if we had a > > single mapping of field names -> numbers in the index, that would gain > > us many of the above benefits? Hmmm. > > You'll still have to be able to remap field numbers when adding > entire indexes. True, but it'd still be good progress for the frequent case of adding/deleting docs to an existing index. Progress not perfection... > > Here's one idea I just had: assuming there are no deletions, you can > > almost do a raw bytes copy from input segment to output (merged) > > segment of the postings for a given term X. I think for prox postings > > you can. > > You can probably squeeze out some nice gains using a skipVint() > function, even with deletions. Good point. I think likewise with copyVInt(int numToCopy). > > But for freq postings, you can't, because they are delta coded. > > I'm working on this task right now for KS. > > KS implements the "Flexible Indexing" paradigm, so all posting data > goes in a single file. > > I've applied an additional constraint to KS: Every binary file must > consist of one type of record repeated over and over. Every indexed > field gets its own dedicated posting file with the suffix .pNNN to > allow per-field posting formats. > > The I/O code is isolated in subclasses of a new class called > "Stepper": You can turn any Stepper loose on its file and read it > from top to tail. When the file format changes, Steppers will get > archived, like old plugins. > > My present task is to write the code for the Stepper subclasses > MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can > wait.) As I write them, I will see if I can figure out format that > can be merged as speedily as possible. Perhaps the precise variant > of delta encoding used in Lucene's .frq file should be avoided. Neat! Yes, designing the file format to accommodate "merging" efficiently (plus searching of course) is a good idea since we lose so much indexing time to this. > > Except: it's only the first entry of the incoming segments's freq > > postings that needs to be re-interpreted? So you could read that one, > > encode the delta based on "last docID" for previous segment (I think > > we'd have to store this in index, probably only if termFreq > > > threshold), and then copyBytes the rest of the posting? I will try > > this out on the merges I'm doing in LUCENE-843; I think it should > > work and make merging faster (assuming no deletes)? > > Ugh, more special case code. > > I have to say, I started trying to go over your patch, and the > overwhelming impression I got coming back to this part of the Lucene > code base in earnest for the first time since using 1.4.3 as a > porting reference was: simplicity seems to be nobody's priority these > days. Unfortunately this is just a tough tradeoff... higher performance code is often not "simple". I also still need to clean up the code, add comments, etc, but even after that, it's not going to look "simple". I think this is just the reality of performance optimization. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]