Hi Grant, The benchmarking code I've been using is in all but the first & last patches I attached on LUCENE-843. Really it's just a modified version of the demo IndexFiles code, plus a new analyzer (SimpleSpaceAnalyzer) that is the same as WhitespaceAnalyzer except it re-uses Token/String instead of allocating a new one for each term.
But, I'd also like to port these into the benchmark contrib framework. My plan is to make a new DocMaker that knows how to read documents "line by line" from a previously created file, to not pay the IO cost of opening a separate file per document, and then make a new class (maybe a task?) that can read documents from a DocMaker and write a single file with one document per line. I just haven't quite gotten to this yet, but I will :) Mike "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > Hi Michael, > > I know you've got your hands full, but was wondering if you could > either post your benchmark code, or better yet, hook it into the > benchmarker contrib (it is quite easy). > > Let me know if I can help, > Grant > > On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote: > > > > > [ https://issues.apache.org/jira/browse/LUCENE-843? > > page=com.atlassian.jira.plugin.system.issuetabpanels:comment- > > tabpanel#action_12506907 ] > > > > Michael McCandless commented on LUCENE-843: > > ------------------------------------------- > > > > OK I ran tests comparing analyzer performance. > > > > It's the same test framework as above, using the ~5,500 byte Europarl > > docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor > > vectors, and CFS=false, indexing 200,000 documents. > > > > The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes > > GC cost by not allocating a Term or String for every token in every > > document. > > > > Each run is best time of 2 runs: > > > > ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP > > SimpleSpaceAnalyzer 79.0 326.5 4.1 X > > StandardAnalyzer 449.0 674.1 1.5 X > > WhitespaceAnalyzer 104.0 338.9 3.3 X > > SimpleAnalyzer 104.7 328.0 3.1 X > > > > StandardAnalyzer is definiteely rather time consuming! > > > > > >> improve how IndexWriter uses RAM to buffer added documents > >> ---------------------------------------------------------- > >> > >> Key: LUCENE-843 > >> URL: https://issues.apache.org/jira/browse/LUCENE-843 > >> Project: Lucene - Java > >> Issue Type: Improvement > >> Components: Index > >> Affects Versions: 2.2 > >> Reporter: Michael McCandless > >> Assignee: Michael McCandless > >> Priority: Minor > >> Attachments: index.presharedstores.cfs.zip, > >> index.presharedstores.nocfs.zip, LUCENE-843.patch, > >> LUCENE-843.take2.patch, LUCENE-843.take3.patch, > >> LUCENE-843.take4.patch, LUCENE-843.take5.patch, > >> LUCENE-843.take6.patch, LUCENE-843.take7.patch, > >> LUCENE-843.take8.patch, LUCENE-843.take9.patch > >> > >> > >> I'm working on a new class (MultiDocumentWriter) that writes more > >> than > >> one document directly into a single Lucene segment, more efficiently > >> than the current approach. > >> This only affects the creation of an initial segment from added > >> documents. I haven't changed anything after that, eg how segments > >> are > >> merged. > >> The basic ideas are: > >> * Write stored fields and term vectors directly to disk (don't > >> use up RAM for these). > >> * Gather posting lists & term infos in RAM, but periodically do > >> in-RAM merges. Once RAM is full, flush buffers to disk (and > >> merge them later when it's time to make a real segment). > >> * Recycle objects/buffers to reduce time/stress in GC. > >> * Other various optimizations. > >> Some of these changes are similar to how KinoSearch builds a segment. > >> But, I haven't made any changes to Lucene's file format nor added > >> requirements for a global fields schema. > >> So far the only externally visible change is a new method > >> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > >> deprecated) so that it flushes according to RAM usage and not a fixed > >> number documents added. > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]