[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-843: -------------------------------------- Attachment: LUCENE-843.take6.patch Attached latest patch. I'm now working towards simplify & cleaning up the code & design: eliminated dead code leftover from the previous iterations, use existing RAMFile instead of my own new class, refactored duplicate/confusing code, added comments, etc. It's getting closer to a committable state but still has a ways to go. I also renamed the class from MultiDocumentWriter to DocumentsWriter. To summarize the current design: 1. Write stored fields & term vectors to files in the Directory immediately (don't buffer these in RAM). 2. Write freq & prox postings to RAM directly as a byte stream instead of first pass as int[] and then second pass as a byte stream. This single-pass instead of double-pass is a big savings. I use slices into shared byte[] arrays to efficiently allocate bytes to the postings the need them. 3. Build Postings hash that holds the Postings for many documents at once instead of a single doc, keyed by unique term. Not tearing down & rebuilding the Postings hash w/ every doc saves alot of time. Also when term vectors are off this saves quicksort for every doc and this gives very good performance gain. When the Postings hash is full (used up the allowed RAM usage) I then create a real Lucene segment when autoCommit=true, else a "partial segment". 4. Use my own "partial segment" format that differs from Lucene's normal segments in that it is optimized for merging (and unusable for searching). This format, and the merger I created to work with this format, performs merging mostly by copying blocks of bytes instead of reinterpreting every vInt in each Postings list. These partial segments are are only created when IndexWriter has autoCommit=false, and then on commit they are merged into the real Lucene segment format. 5. Reuse the Posting, PostingVector, char[] and byte[] objects that are used by the Postings hash. I plan to keep simplifying the design & implementation. Specifically, I'm going to test removing #4 above entirely (using my own "partial segment" format that's optimized for merging not searching). While doing this may give back some of the performance gains, that code is the source of much added complexity in the patch, and, it duplicates the current SegmentMerger code. It was more necessary before (when we would merge thousands of single-doc segments in memory) but now that each segment contains many docs I think we are no longer gaining as much performance from it. I plan instead to write all segments in the "real" Lucene segment format and use the current SegmentMerger, possibly w/ some small changes, to do the merges even when autoCommit=false. Since we have another issue (LUCENE-856) to optimize segment merging I can carry over any optimizations that we may want to keep into that issue. If this doesn't lose much performance it will make the approach here even simpler. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]