[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Mon, 21 May 2007 11:16:37 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take6.patch

Attached latest patch.

I'm now working towards simplify & cleaning up the code & design:
eliminated dead code leftover from the previous iterations, use
existing RAMFile instead of my own new class, refactored
duplicate/confusing code, added comments, etc. It's getting closer to
a committable state but still has a ways to go.

I also renamed the class from MultiDocumentWriter to DocumentsWriter.

To summarize the current design:

  1. Write stored fields & term vectors to files in the Directory
     immediately (don't buffer these in RAM).

  2. Write freq & prox postings to RAM directly as a byte stream
     instead of first pass as int[] and then second pass as a byte
     stream.  This single-pass instead of double-pass is a big
     savings.  I use slices into shared byte[] arrays to efficiently
     allocate bytes to the postings the need them.

  3. Build Postings hash that holds the Postings for many documents at
     once instead of a single doc, keyed by unique term.  Not tearing
     down & rebuilding the Postings hash w/ every doc saves alot of
     time.  Also when term vectors are off this saves quicksort for
     every doc and this gives very good performance gain.

     When the Postings hash is full (used up the allowed RAM usage) I
     then create a real Lucene segment when autoCommit=true, else a
     "partial segment".

  4. Use my own "partial segment" format that differs from Lucene's
     normal segments in that it is optimized for merging (and unusable
     for searching).  This format, and the merger I created to work
     with this format, performs merging mostly by copying blocks of
     bytes instead of reinterpreting every vInt in each Postings list.
     These partial segments are are only created when IndexWriter has
     autoCommit=false, and then on commit they are merged into the
     real Lucene segment format.

  5. Reuse the Posting, PostingVector, char[] and byte[] objects that
     are used by the Postings hash.

I plan to keep simplifying the design & implementation.  Specifically,
I'm going to test removing #4 above entirely (using my own "partial
segment" format that's optimized for merging not searching).

While doing this may give back some of the performance gains, that
code is the source of much added complexity in the patch, and, it
duplicates the current SegmentMerger code.  It was more necessary
before (when we would merge thousands of single-doc segments in
memory) but now that each segment contains many docs I think we are no
longer gaining as much performance from it.

I plan instead to write all segments in the "real" Lucene segment
format and use the current SegmentMerger, possibly w/ some small
changes, to do the merges even when autoCommit=false.  Since we have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue.  If
this doesn't lose much performance it will make the approach here even
simpler.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to