On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?
Yeah. KS has to decompress and serialize posting content, which sux.
The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.
Yeah we need to work on this one.
What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another. KS has
solved this problem for stored fields. Field defs are global and
field values are keyed by name rather than field number in the field
data file. Benefits:
* Whole documents can be read from one segment to
another as blobs.
* No flags byte.
* No remapping of field numbers.
* No conflict resolution at all.
* Compressed, uncompressed... doesn't matter.
* Less code.
* The possibility of allowing the user to provide their
own subclass for reading and writing fields. (For
Lucy, in the language of your choice.)
What I haven't got yet is a way to move terms and postings
economically from one segment to another. But I'm working on it. :)
One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size. With KS's
Fibonacci merge policy, you don't. LUCENE-854 has some more details.
However, even under Fibo, when you get socked with a big merge, you
really get socked. It bothers me that the time for adding to your
index can vary so unpredictably.
Segment merging really is costly. In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging! Details are in
LUCENE-856...
This is a great model. Are there Python bindings to Lucy yet/coming?
I'm sure that they will appear once the C core is ready. The
approach I am taking is to make some high-level design decisions
collaboratively on lucy-dev, then implement them in KS. There's a
large amount of code that has been written according to our specs
that is working in KS and ready to commit to Lucy after trivial
changes. There's more that's ready for review. However, release of
KS 0.20 is taking priority, so code flow into the Lucy repository has
slowed.
I'll also be looking for a job in about a month. That may slow us
down some more, though it won't stop things -- I've basically
decided that I'll do what it takes to Lucy off the ground. I'll go
with something stopgap if nothing materializes which is compatible
with that commitment.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]