Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless Fri, 23 Mar 2007 08:01:03 -0800

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
> > Merging is costly because you read all data in then write all data
> > out, so, you want to minimize for byte of data in the index in the
> > index how many times it will be "serviced" (read in, written out) as
> > part of a merge.
> 
> Avoiding the re-writing of stored fields might be nice:
> http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html


That's exactly the approach I'm taking in LUCENE-843: stored fields and term
vectors are immediately written to disk.  Only frq, prx and tis use up
memory.  This greatly extends how many docs you can buffer before
having to flush (assuming your docs have stored fields and term
vectors).

When memory is full, I either flush a segment to disk (when writer is
in autoCommit=true mode), else I flush the data to tmp files which are
finally merged into a segment when the writer is closed.  This merging
is less costly because the bytes in/out are just frq, prx and tis, so
this improves performance of autoCommit=false mode vs autoCommit=true
mode.

But, this is only for the segment created from buffered docs (ie the
segment created by a "flush").  Subsequent merges still must copy
bytes in/out and in LUCENE-843 I haven't changed anything about how
segments are merged.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to