[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506811 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- > Does DocumentsWriter also solve the problem DocumentWriter had > before LUCENE-880? I believe the answer is yes. Even though you > close the TokenStreams in the finally clause of invertField() like > DocumentWriter did before 880 this is safe, because addPosition() > serializes the term strings and payload bytes into the posting hash > table right away. Is that right? That's right. When I merged in the fix for LUCENE-880, I realized with this patch it's fine to close the token stream immediately after processing all of its tokens because everything about the token stream has been "absorbed" into postings hash. > the benchmarks you run focus on measuring the pure indexing > performance. I think it would be interesting to know how big the > speedup is in real-life scenarios, i. e. with StandardAnalyzer and > maybe even HTML parsing? For sure the speedup will be less, but it > should still be a significant improvement. Did you run those kinds > of benchmarks already? Good question ... I haven't measured the performance cost of using StandardAnalyzer or HTML parsing but I will test & post back. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]