[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless (JIRA) Thu, 21 Jun 2007 10:59:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506974
 ]


Michael McCandless commented on LUCENE-843:
-------------------------------------------

> Do you think your code is easily extensible in this regard? I'm 
> wondering because of all the optimizations you're doing like e. g.
> sharing byte arrays. But I'm certainly not familiar enough with your code 
> yet, so I'm only guessing here.

Good question!

DocumentsWriter is definitely more complex than DocumentWriter, but it
doesn't prevent extensibility and I think will work very well when we
do flexible indexing.

The patch now has dedicated methods for writing into the freq/prox/etc
streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte',
'writeProxVInt', etc.), but, this could easily be changed to instead
use true IndexOutput streams.  This would then hide all details of
shared byte arrays from whoever is doing the writing.

The way I roughly see flexible indexing working in the future is
DocumentsWriter will be responsible for keeping track of unique terms
seen (in its hash table), holding the Posting instance (which could be
subclassed in the future) for each term, flushing a real segment when
full, handling shared byte arrays, etc.  Ie all the "infrastructure".

But then the specific logic of what bytes are written into which
streams (freq/prox/vectors/others) will be handled by a separate class
or classes that we can plug/unplug according to some "schema".
DocumentsWriter would call on these classes and provide the
IndexOutput's for all streams for the Posting, per position, and these
classes write their own format into the IndexOutputs.

I think a separation like that would work well: we could have good
performance and also extensibility.  Devil is in the details of
course...

I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index format) but I
think this is very doable in the future.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to