Re: improve how IndexWriter uses RAM to buffer added documents

Marvin Humphrey Thu, 05 Apr 2007 13:50:20 -0700


On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:

(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.


Yeah we need to work on this one.

What we need to do is cut down on decompression and conflictresolution costs when reading from one segment to another. KS hassolved this problem for stored fields. Field defs are global andfield values are keyed by name rather than field number in the fielddata file. Benefits:


  * Whole documents can be read from one segment to
    another as blobs.
  * No flags byte.
  * No remapping of field numbers.
  * No conflict resolution at all.
  * Compressed, uncompressed... doesn't matter.
  * Less code.
  * The possibility of allowing the user to provide their
    own subclass for reading and writing fields. (For
    Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postingseconomically from one segment to another. But I'm working on it. :)

One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

However, even under Fibo, when you get socked with a big merge, youreally get socked. It bothers me that the time for adding to yourindex can vary so unpredictably.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

This is a great model.  Are there Python bindings to Lucy yet/coming?

I'm sure that they will appear once the C core is ready. Theapproach I am taking is to make some high-level design decisionscollaboratively on lucy-dev, then implement them in KS. There's alarge amount of code that has been written according to our specsthat is working in KS and ready to commit to Lucy after trivialchanges. There's more that's ready for review. However, release ofKS 0.20 is taking priority, so code flow into the Lucy repository hasslowed.

I'll also be looking for a job in about a month. That may slow usdown some more, though it won't stop things -- I've basicallydecided that I'll do what it takes to Lucy off the ground. I'll gowith something stopgap if nothing materializes which is compatiblewith that commitment.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to