"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote: > > >>> (I think for KS you "add" a previous segment not that > >>> differently from how you "add" a document)? > >> > >> Yeah. KS has to decompress and serialize posting content, which sux. > >> > >> The one saving grace is that with the Fibonacci merge schedule and > >> the seg-at-a-time indexing strategy, segments don't get merged nearly > >> as often as they do in Lucene. > > > > Yeah we need to work on this one. > > What we need to do is cut down on decompression and conflict > resolution costs when reading from one segment to another. KS has > solved this problem for stored fields. Field defs are global and > field values are keyed by name rather than field number in the field > data file. Benefits: > > * Whole documents can be read from one segment to > another as blobs. > * No flags byte. > * No remapping of field numbers. > * No conflict resolution at all. > * Compressed, uncompressed... doesn't matter. > * Less code. > * The possibility of allowing the user to provide their > own subclass for reading and writing fields. (For > Lucy, in the language of your choice.)
I hear you, and I really really love those benefits, but, we just don't have this freedom with Lucene. I think the ability to suddenly birth a new field, or change a field's attributes like "has vectors", "stores norms", etc., with a new document, is something we just can't break at this point with Lucene? If we could get those benefits without breaking backwards compatibility then that would be awesome. I suppose if we had a single mapping of field names -> numbers in the index, that would gain us many of the above benefits? Hmmm. > What I haven't got yet is a way to move terms and postings > economically from one segment to another. But I'm working on it. :) Here's one idea I just had: assuming there are no deletions, you can almost do a raw bytes copy from input segment to output (merged) segment of the postings for a given term X. I think for prox postings you can. But for freq postings, you can't, because they are delta coded. Except: it's only the first entry of the incoming segments's freq postings that needs to be re-interpreted? So you could read that one, encode the delta based on "last docID" for previous segment (I think we'd have to store this in index, probably only if termFreq > threshold), and then copyBytes the rest of the posting? I will try this out on the merges I'm doing in LUCENE-843; I think it should work and make merging faster (assuming no deletes)? > > One thing that irks me about the > > current Lucene merge policy (besides that it gets confused when you > > flush-by-RAM-usage) is that it's a "pay it forward" design so you're > > alwa>ys over-paying when you build a given index size. With KS's > > Fibonacci merge policy, you don't. LUCENE-854 has some more details. > > However, even under Fibo, when you get socked with a big merge, you > really get socked. It bothers me that the time for adding to your > index can vary so unpredictably. Yeah, I think that's best solved by concurrency (either with threads or with our own "scheduling" eg on adding a doc you go and merge another N terms in the running merge)? There have been several proposals recently for making Lucene's merging concurrent (backgrounded), as part of LUCENE-847. > > Segment merging really is costly. In building a large (86 GB, 10 MM > > docs) index, 65.6% of the time was spent merging! Details are in > > LUCENE-856... > > > This is a great model. Are there Python bindings to Lucy yet/coming? > > I'm sure that they will appear once the C core is ready. The > approach I am taking is to make some high-level design decisions > collaboratively on lucy-dev, then implement them in KS. There's a > large amount of code that has been written according to our specs > that is working in KS and ready to commit to Lucy after trivial > changes. There's more that's ready for review. However, release of > KS 0.20 is taking priority, so code flow into the Lucy repository has > slowed. OK, good to hear. > I'll also be looking for a job in about a month. That may slow us > down some more, though it won't stop things -- I've basically > decided that I'll do what it takes to Lucy off the ground. I'll go > with something stopgap if nothing materializes which is compatible > with that commitment. Whoa, I'm sorry to hear that :( I hope you land, quickly, somewhere that takes Lucy/KS seriously. It's clearly excellent work. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]