On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:

(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?

Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.

Yeah we need to work on this one.

What we need to do is cut down on decompression and conflict resolution costs when reading from one segment to another. KS has solved this problem for stored fields. Field defs are global and field values are keyed by name rather than field number in the field data file. Benefits:

  * Whole documents can be read from one segment to
    another as blobs.
  * No flags byte.
  * No remapping of field numbers.
  * No conflict resolution at all.
  * Compressed, uncompressed... doesn't matter.
  * Less code.
  * The possibility of allowing the user to provide their
    own subclass for reading and writing fields. (For
    Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postings economically from one segment to another. But I'm working on it. :)

One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

However, even under Fibo, when you get socked with a big merge, you really get socked. It bothers me that the time for adding to your index can vary so unpredictably.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

This is a great model.  Are there Python bindings to Lucy yet/coming?

I'm sure that they will appear once the C core is ready. The approach I am taking is to make some high-level design decisions collaboratively on lucy-dev, then implement them in KS. There's a large amount of code that has been written according to our specs that is working in KS and ready to commit to Lucy after trivial changes. There's more that's ready for review. However, release of KS 0.20 is taking priority, so code flow into the Lucy repository has slowed.

I'll also be looking for a job in about a month. That may slow us down some more, though it won't stop things -- I've basically decided that I'll do what it takes to Lucy off the ground. I'll go with something stopgap if nothing materializes which is compatible with that commitment.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to