Re: improve how IndexWriter uses RAM to buffer added documents

Michael McCandless Thu, 05 Apr 2007 17:27:10 -0700

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
> 
> >>> (I think for KS you "add" a previous segment not that
> >>> differently from how you "add" a document)?
> >>
> >> Yeah.  KS has to decompress and serialize posting content, which sux.
> >>
> >> The one saving grace is that with the Fibonacci merge schedule and
> >> the seg-at-a-time indexing strategy, segments don't get merged nearly
> >> as often as they do in Lucene.
> >
> > Yeah we need to work on this one.
> 
> What we need to do is cut down on decompression and conflict  
> resolution costs when reading from one segment to another.  KS has  
> solved this problem for stored fields.  Field defs are global and  
> field values are keyed by name rather than field number in the field  
> data file.  Benefits:
> 
>    * Whole documents can be read from one segment to
>      another as blobs.
>    * No flags byte.
>    * No remapping of field numbers.
>    * No conflict resolution at all.
>    * Compressed, uncompressed... doesn't matter.
>    * Less code.
>    * The possibility of allowing the user to provide their
>      own subclass for reading and writing fields. (For
>      Lucy, in the language of your choice.)


I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

I think the ability to suddenly birth a new field, or change a field's
attributes like "has vectors", "stores norms", etc., with a new
document, is something we just can't break at this point with Lucene?

If we could get those benefits without breaking backwards
compatibility then that would be awesome.  I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.

> What I haven't got yet is a way to move terms and postings  
> economically from one segment to another.  But I'm working on it.  :)

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.  But for freq postings, you can't, because they are delta
coded.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?

> > One thing that irks me about the
> > current Lucene merge policy (besides that it gets confused when you
> > flush-by-RAM-usage) is that it's a "pay it forward" design so you're
> > alwa>ys over-paying when you build a given index size.  With KS's
> > Fibonacci merge policy, you don't.  LUCENE-854 has some more details.
> 
> However, even under Fibo, when you get socked with a big merge, you  
> really get socked.  It bothers me that the time for adding to your  
> index can vary so unpredictably.

Yeah, I think that's best solved by concurrency (either with threads
or with our own "scheduling" eg on adding a doc you go and merge
another N terms in the running merge)?  There have been several
proposals recently for making Lucene's merging concurrent
(backgrounded), as part of LUCENE-847.

> > Segment merging really is costly.  In building a large (86 GB, 10 MM
> > docs) index, 65.6% of the time was spent merging!  Details are in
> > LUCENE-856...
> 
> > This is a great model.  Are there Python bindings to Lucy yet/coming?
> 
> I'm sure that they will appear once the C core is ready.  The  
> approach I am taking is to make some high-level design decisions  
> collaboratively on lucy-dev, then implement them in KS.  There's a  
> large amount of code that has been written according to our specs  
> that is working in KS and ready to commit to Lucy after trivial  
> changes.  There's more that's ready for review.  However, release of  
> KS 0.20 is taking priority, so code flow into the Lucy repository has  
> slowed.

OK, good to hear.

> I'll also be looking for a job in about a month.  That may slow us  
> down some more, though it won't stop things --  I've basically  
> decided that I'll do what it takes to Lucy off the ground.  I'll go  
> with something stopgap if nothing materializes which is compatible  
> with that commitment.

Whoa, I'm sorry to hear that :(  I hope you land, quickly, somewhere
that takes Lucy/KS seriously.  It's clearly excellent work.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to