"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 3/22/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Merging is costly because you read all data in then write all data > > out, so, you want to minimize for byte of data in the index in the > > index how many times it will be "serviced" (read in, written out) as > > part of a merge. > > Avoiding the re-writing of stored fields might be nice: > http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-p6177280.html
That's exactly the approach I'm taking in LUCENE-843: stored fields and term vectors are immediately written to disk. Only frq, prx and tis use up memory. This greatly extends how many docs you can buffer before having to flush (assuming your docs have stored fields and term vectors). When memory is full, I either flush a segment to disk (when writer is in autoCommit=true mode), else I flush the data to tmp files which are finally merged into a segment when the writer is closed. This merging is less costly because the bytes in/out are just frq, prx and tis, so this improves performance of autoCommit=false mode vs autoCommit=true mode. But, this is only for the segment created from buffered docs (ie the segment created by a "flush"). Subsequent merges still must copy bytes in/out and in LUCENE-843 I haven't changed anything about how segments are merged. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]