[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Tue, 03 Apr 2007 05:11:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486332
 ]


Michael McCandless commented on LUCENE-843:
-------------------------------------------

A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of
4 SATA drives.

Using the above tool I ran an initial set of benchmarks comparing old
(= Lucene trunk) vs new (= this patch), varying document size (~550
bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
"en").

For each document size I run 4 combinations of whether term vectors
and stored fields are on or off and whether autoCommit is true or
false.  I measure net docs/sec (= total # docs indexed divided by
total time taken), RAM efficiency (= avg # docs flushed with each
flush divided by RAM buffer size), and avg HEAP RAM usage before each
flush.

Here are the results for the 10K tokens (= ~55,000 bytes plain text)
per document:

  20000 DOCS @ ~55,000 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 200.3 secs
          index size = 358M

        new
          20000 docs in 126.0 secs
          index size = 356M

        Total Docs/sec:             old    99.8; new   158.7 [   59.0% faster]
        Docs/MB @ flush:            old    24.2; new    49.1 [  102.5% more]
        Avg RAM used (MB) @ flush:  old    74.5; new    36.2 [   51.4% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 202.7 secs
          index size = 358M

        new
          20000 docs in 120.0 secs
          index size = 354M

        Total Docs/sec:             old    98.7; new   166.7 [   69.0% faster]
        Docs/MB @ flush:            old    24.2; new    48.9 [  101.7% more]
        Avg RAM used (MB) @ flush:  old    74.3; new    37.0 [   50.2% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 374.7 secs
          index size = 1.4G

        new
          20000 docs in 236.1 secs
          index size = 1.4G

        Total Docs/sec:             old    53.4; new    84.7 [   58.7% faster]
        Docs/MB @ flush:            old    10.2; new    49.1 [  382.8% more]
        Avg RAM used (MB) @ flush:  old   129.3; new    36.6 [   71.7% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 385.7 secs
          index size = 1.4G

        new
          20000 docs in 182.8 secs
          index size = 1.4G

        Total Docs/sec:             old    51.9; new   109.4 [  111.0% faster]
        Docs/MB @ flush:            old    10.2; new    48.9 [  380.9% more]
        Avg RAM used (MB) @ flush:  old    76.0; new    37.3 [   50.9% less]



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to