[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Fri, 08 Jun 2007 06:32:51 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take7.patch

Latest working patch attached.

I've cutover to using Lucene's normal segment merging for all merging
(ie, I no longer use a different merge-efficient format for segments
when autoCommit=false); this has substantially simplified the code.

All unit tests pass except disk-full test and certain contrib tests
(gdata-server, lucli, similarity, wordnet) that I think I'm not
causing.

Other changes:

  * Consolidated flushing of a new segment back into IndexWriter
    (previously DocumentsWriter would do its own flushing when
    autoCommit=false).

    I would also like to consolidate merging entirely into
    IndexWriter; right now DocumentsWriter does its own merging of the
    flushed segments when autoCommit=false (this is because those
    segments are "partial" meaning they do not have their own stored
    fields or term vectors).  I'm trying to find a clean way to do
    this...

  * Thread concurrency now works: each thread writes into a separate
    Postings hash (up until a limit (currently 5) at which point the
    threads share the Postings hashes) and then when flushing the
    segment I merge the docIDs together. I flush when the total RAM
    used across threads is over the limit.  I ran a test comparing
    thread concurrency on current trunk vs this patch, which I'll post
    next.

  * Reduced bytes used per-unique-term to be lower than current
    Lucene.  This means the worst-case document (many terms, all of
    which are unique) should use less RAM overall than Lucene trunk
    does.

  * Added some new unit test cases; added missing "writer.close()" to
    one of the contrib tests.

  * Cleanup, comments, etc.  I think the code is getting more
    "approachable" now.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to