Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless Fri, 22 Jun 2007 13:15:07 -0700

Hi Grant,

The benchmarking code I've been using is in all but the first & last
patches I attached on LUCENE-843.  Really it's just a modified version
of the demo IndexFiles code, plus a new analyzer (SimpleSpaceAnalyzer)
that is the same as WhitespaceAnalyzer except it re-uses Token/String
instead of allocating a new one for each term.


But, I'd also like to port these into the benchmark contrib framework.
My plan is to make a new DocMaker that knows how to read documents
"line by line" from a previously created file, to not pay the IO cost
of opening a separate file per document, and then make a new class
(maybe a task?) that can read documents from a DocMaker and write a
single file with one document per line.

I just haven't quite gotten to this yet, but I will :)

Mike

"Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> Hi Michael,
> 
> I know you've got your hands full, but was wondering if you could  
> either post your benchmark code, or better yet, hook it into the  
> benchmarker contrib (it is quite easy).
> 
> Let me know if I can help,
> Grant
> 
> On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:
> 
> >
> >     [ https://issues.apache.org/jira/browse/LUCENE-843? 
> > page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> > tabpanel#action_12506907 ]
> >
> > Michael McCandless commented on LUCENE-843:
> > -------------------------------------------
> >
> > OK I ran tests comparing analyzer performance.
> >
> > It's the same test framework as above, using the ~5,500 byte Europarl
> > docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
> > vectors, and CFS=false, indexing 200,000 documents.
> >
> > The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
> > GC cost by not allocating a Term or String for every token in every
> > document.
> >
> > Each run is best time of 2 runs:
> >
> >   ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
> >   SimpleSpaceAnalyzer  79.0       326.5        4.1 X
> >   StandardAnalyzer    449.0       674.1        1.5 X
> >   WhitespaceAnalyzer  104.0       338.9        3.3 X
> >   SimpleAnalyzer      104.7       328.0        3.1 X
> >
> > StandardAnalyzer is definiteely rather time consuming!
> >
> >
> >> improve how IndexWriter uses RAM to buffer added documents
> >> ----------------------------------------------------------
> >>
> >>                 Key: LUCENE-843
> >>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
> >>             Project: Lucene - Java
> >>          Issue Type: Improvement
> >>          Components: Index
> >>    Affects Versions: 2.2
> >>            Reporter: Michael McCandless
> >>            Assignee: Michael McCandless
> >>            Priority: Minor
> >>         Attachments: index.presharedstores.cfs.zip,  
> >> index.presharedstores.nocfs.zip, LUCENE-843.patch,  
> >> LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
> >> LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
> >> LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
> >> LUCENE-843.take8.patch, LUCENE-843.take9.patch
> >>
> >>
> >> I'm working on a new class (MultiDocumentWriter) that writes more  
> >> than
> >> one document directly into a single Lucene segment, more efficiently
> >> than the current approach.
> >> This only affects the creation of an initial segment from added
> >> documents.  I haven't changed anything after that, eg how segments  
> >> are
> >> merged.
> >> The basic ideas are:
> >>   * Write stored fields and term vectors directly to disk (don't
> >>     use up RAM for these).
> >>   * Gather posting lists & term infos in RAM, but periodically do
> >>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
> >>     merge them later when it's time to make a real segment).
> >>   * Recycle objects/buffers to reduce time/stress in GC.
> >>   * Other various optimizations.
> >> Some of these changes are similar to how KinoSearch builds a segment.
> >> But, I haven't made any changes to Lucene's file format nor added
> >> requirements for a global fields schema.
> >> So far the only externally visible change is a new method
> >> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> >> deprecated) so that it flushes according to RAM usage and not a fixed
> >> number documents added.
> >
> > -- 
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to