[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486293
 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------

To do the benchmarking I created a simple standalone tool
(demo/IndexLineFiles, in the last patch) that indexes one line at a
time from a large previously created file, optionally using multiple
threads.  I do it this way to minimize IO cost of pulling the document
source because I want to measure just indexing time as much as possible.

Each line is read and a doc is created with field "contents" that is
not stored, is tokenized, and optionally has term vectors with
position+offsets.  I also optionally add two small only-stored fields
("path" and "modified").  I think these are fairly trivial documents
compared to typical usage of Lucene.

For the corpus, I took Europarl's "en" content, stripped tags, and
processed into 3 files: one with 100 tokens per line (= ~550 bytes),
one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens
per line (= ~55,000 bytes) plain text per line.

All settings (mergeFactor, compound file, etc.) are left at defaults.
I don't optimize the index in the end.  I'm using my new
SimpleSpaceAnalyzer (just splits token on the space character and
creates token text as slice into a char[] array instead of new
String(...)) to minimize the cost of tokenization.

I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
dual core) OS X box with 2 GB RAM.  I give java 1 GB heap (-Xmx1024m).


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to