[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507587
 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------


> I thought it would be interesting to see how the new code performs in this 
> scenario, what do you think?

Yes I'd be very interested to see the results of this.  It's a
somewhat "unusual" indexing situation (such tiny docs) but it's a real
world test case.  Thanks!

>  - what settings do you recommend?

I think these are likely the important ones in this case:

  * Flush by RAM instead of doc count
    (writer.setRAMBufferSizeMB(...)).

  * Give it as much RAM as you can.

  * Use maybe 3 indexing threads (if you can).

  * Turn off compound file.

  * If you have stored fields/vectors (seems not in this case) use
    autoCommit=false.

  * Use a trivial analyzer that doesn't create new String/new Token
    (re-use the same Token, and use the char[] based term text
    storage instead of the String one).

  * Re-use Document/Field instances.  The DocumentsWriter is fine with
    this and it saves substantial time from GC especially because your
    docs are so tiny (per-doc overhead is otherwise a killer).  In
    IndexLineFiles I made a StringReader that lets me reset its String
    value; this way I didn't have to change the Field instances stored
    in the Document.

>  - is there any chance for speed-up in optimize()?  I didn't read
>    your new code yet, but at least from some comments here it seems
>    that on disk merging was not changed... is this (still) so? I would

Correct: my patch doesn't touch merging and optimizing.  All it does
now is gather many docs in RAM and then flush a new segment when it's
time.  I've opened a separate issue (LUCENE-856) for optimizations
in segment merging.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to