[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502793
 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------

I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch.  The test is the
same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
RAID 0 IO system.

Here are the raw results:

DOCS = ~5,500 bytes plain text
RAM = 32 MB
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)

NUM THREADS = 1

        new
          200000 docs in 172.3 secs
          index size = 1.7G

        old
          200000 docs in 539.5 secs
          index size = 1.7G

        Total Docs/sec:             old   370.7; new  1161.0 [  213.2% faster]
        Docs/MB @ flush:            old    47.9; new   334.6 [  598.7% more]
        Avg RAM used (MB) @ flush:  old   131.9; new    33.1 [   74.9% less]


NUM THREADS = 2

        new
          200001 docs in 130.8 secs
          index size = 1.7G

        old
          200001 docs in 452.8 secs
          index size = 1.7G

        Total Docs/sec:             old   441.7; new  1529.3 [  246.2% faster]
        Docs/MB @ flush:            old    47.9; new   301.5 [  529.7% more]
        Avg RAM used (MB) @ flush:  old   226.1; new    35.2 [   84.4% less]


NUM THREADS = 3

        new
          200002 docs in 105.4 secs
          index size = 1.7G

        old
          200002 docs in 428.4 secs
          index size = 1.7G

        Total Docs/sec:             old   466.8; new  1897.9 [  306.6% faster]
        Docs/MB @ flush:            old    47.9; new   277.8 [  480.2% more]
        Avg RAM used (MB) @ flush:  old   289.8; new    37.0 [   87.2% less]


NUM THREADS = 4

        new
          200003 docs in 104.8 secs
          index size = 1.7G

        old
          200003 docs in 440.4 secs
          index size = 1.7G

        Total Docs/sec:             old   454.1; new  1908.5 [  320.3% faster]
        Docs/MB @ flush:            old    47.9; new   259.9 [  442.9% more]
        Avg RAM used (MB) @ flush:  old   293.7; new    37.1 [   87.3% less]


NUM THREADS = 5

        new
          200004 docs in 99.5 secs
          index size = 1.7G

        old
          200004 docs in 425.0 secs
          index size = 1.7G

        Total Docs/sec:             old   470.6; new  2010.5 [  327.2% faster]
        Docs/MB @ flush:            old    47.9; new   245.3 [  412.6% more]
        Avg RAM used (MB) @ flush:  old   390.9; new    38.3 [   90.2% less]


NUM THREADS = 6

        new
          200005 docs in 106.3 secs
          index size = 1.7G

        old
          200005 docs in 427.1 secs
          index size = 1.7G

        Total Docs/sec:             old   468.2; new  1882.3 [  302.0% faster]
        Docs/MB @ flush:            old    47.8; new   248.5 [  419.3% more]
        Avg RAM used (MB) @ flush:  old   340.9; new    38.7 [   88.6% less]


NUM THREADS = 7

        new
          200006 docs in 106.1 secs
          index size = 1.7G

        old
          200006 docs in 435.2 secs
          index size = 1.7G

        Total Docs/sec:             old   459.6; new  1885.3 [  310.2% faster]
        Docs/MB @ flush:            old    47.8; new   248.7 [  420.0% more]
        Avg RAM used (MB) @ flush:  old   408.6; new    39.1 [   90.4% less]


NUM THREADS = 8

        new
          200007 docs in 109.0 secs
          index size = 1.7G

        old
          200007 docs in 469.2 secs
          index size = 1.7G

        Total Docs/sec:             old   426.3; new  1835.2 [  330.5% faster]
        Docs/MB @ flush:            old    47.8; new   251.3 [  425.5% more]
        Avg RAM used (MB) @ flush:  old   448.9; new    39.0 [   91.3% less]



Some quick comments:

  * Both trunk & the patch show speedups if you use more than 1 thread
    to do indexing.  This is expected since the machine has concurrency. 

  * The biggest speedup is from 1->2 threads but still good gains from
    2->5 threads.

  * Best seems to be 5 threads.

  * The patch allows better concurrency: relatively speaking it speeds
    up faster than the trunk (the % faster increases as we add
    threads) as you increase # threads.  I think this makes sense
    because we flush less often with the patch, and, flushing is time
    consuming and single threaded.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to