[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Mon, 30 Apr 2007 03:40:36 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take5.patch

I attached a new iteration of the patch.  It's quite different from
the last patch.

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs.  This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach, especially
for smaller docs and especially when term vectors are off (because no
quicksort() is needed until the segment is flushed).  I will attach
new benchmark results.

Other changes:

  * Changed my benchmarking tool / testing (IndexLineFiles):

    - I turned off compound file (to reduce time NOT spent on
      indexing).

    - I noticed I was not downcasing the terms, so I fixed that

    - I now do my own line processing to reduce GC cost of
      "BufferedReader.readLine" (to reduct time NOT spent on
      indexing).

  * Norms now properly flush to disk in the autoCommit=false case

  * All unit tests pass except disk full

  * I turned on asserts for unit tests (jvm arg -ea added to junit ant
    task).  I think we should use asserts when running tests.  I have
    quite a few asserts now.

With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding.  This speeds things up
because we don't double-process the postings.  It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.

When enough RAM is used by the Posting entries plus the byte[]
buffers, I flush them to a partial RAM segment.  When enough of these
RAM segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to