[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Mon, 30 Apr 2007 06:49:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492668
 ]


Michael McCandless commented on LUCENE-843:
-------------------------------------------

Results with the above patch:

RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10


  2000000 DOCS @ ~550 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 782.8 secs
          index size = 436M

        new
          2000000 docs in 93.4 secs
          index size = 430M

        Total Docs/sec:             old  2554.8; new 21421.1 [  738.5% faster]
        Docs/MB @ flush:            old   128.0; new  4058.0 [ 3069.6% more]
        Avg RAM used (MB) @ flush:  old   140.2; new    38.0 [   72.9% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 780.2 secs
          index size = 436M

        new
          2000000 docs in 90.6 secs
          index size = 427M

        Total Docs/sec:             old  2563.3; new 22086.8 [  761.7% faster]
        Docs/MB @ flush:            old   128.0; new  4118.4 [ 3116.7% more]
        Avg RAM used (MB) @ flush:  old   144.6; new    36.4 [   74.8% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 1227.6 secs
          index size = 2.1G

        new
          2000000 docs in 559.8 secs
          index size = 2.1G

        Total Docs/sec:             old  1629.2; new  3572.5 [  119.3% faster]
        Docs/MB @ flush:            old    93.1; new  4058.0 [ 4259.1% more]
        Avg RAM used (MB) @ flush:  old   193.4; new    38.5 [   80.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 1229.2 secs
          index size = 2.1G

        new
          2000000 docs in 241.0 secs
          index size = 2.1G

        Total Docs/sec:             old  1627.0; new  8300.0 [  410.1% faster]
        Docs/MB @ flush:            old    93.1; new  4118.4 [ 4323.9% more]
        Avg RAM used (MB) @ flush:  old   150.5; new    38.4 [   74.5% less]



  200000 DOCS @ ~5,500 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 352.2 secs
          index size = 406M

        new
          200000 docs in 86.4 secs
          index size = 403M

        Total Docs/sec:             old   567.9; new  2313.7 [  307.4% faster]
        Docs/MB @ flush:            old    83.5; new   420.0 [  402.7% more]
        Avg RAM used (MB) @ flush:  old    76.8; new    38.1 [   50.4% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 399.2 secs
          index size = 406M

        new
          200000 docs in 89.6 secs
          index size = 400M

        Total Docs/sec:             old   501.0; new  2231.0 [  345.3% faster]
        Docs/MB @ flush:            old    83.5; new   422.6 [  405.8% more]
        Avg RAM used (MB) @ flush:  old    76.7; new    36.2 [   52.7% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 594.2 secs
          index size = 1.7G

        new
          200000 docs in 229.0 secs
          index size = 1.7G

        Total Docs/sec:             old   336.6; new   873.3 [  159.5% faster]
        Docs/MB @ flush:            old    47.9; new   420.0 [  776.9% more]
        Avg RAM used (MB) @ flush:  old   157.8; new    38.0 [   75.9% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 605.1 secs
          index size = 1.7G

        new
          200000 docs in 181.3 secs
          index size = 1.7G

        Total Docs/sec:             old   330.5; new  1103.1 [  233.7% faster]
        Docs/MB @ flush:            old    47.9; new   422.6 [  782.2% more]
        Avg RAM used (MB) @ flush:  old   132.0; new    37.1 [   71.9% less]



  20000 DOCS @ ~55,000 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 180.8 secs
          index size = 350M

        new
          20000 docs in 79.1 secs
          index size = 349M

        Total Docs/sec:             old   110.6; new   252.8 [  128.5% faster]
        Docs/MB @ flush:            old    25.0; new    46.8 [   87.0% more]
        Avg RAM used (MB) @ flush:  old   112.2; new    44.3 [   60.5% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 180.1 secs
          index size = 350M

        new
          20000 docs in 75.9 secs
          index size = 347M

        Total Docs/sec:             old   111.0; new   263.5 [  137.3% faster]
        Docs/MB @ flush:            old    25.0; new    47.5 [   89.7% more]
        Avg RAM used (MB) @ flush:  old   111.1; new    42.5 [   61.7% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 323.1 secs
          index size = 1.4G

        new
          20000 docs in 183.9 secs
          index size = 1.4G

        Total Docs/sec:             old    61.9; new   108.7 [   75.7% faster]
        Docs/MB @ flush:            old    10.4; new    46.8 [  348.3% more]
        Avg RAM used (MB) @ flush:  old    74.2; new    44.9 [   39.5% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 323.5 secs
          index size = 1.4G

        new
          20000 docs in 135.6 secs
          index size = 1.4G

        Total Docs/sec:             old    61.8; new   147.5 [  138.5% faster]
        Docs/MB @ flush:            old    10.4; new    47.5 [  354.8% more]
        Avg RAM used (MB) @ flush:  old    74.3; new    42.9 [   42.2% less]



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to