[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-843: -------------------------------------- Attachment: LUCENE-843.take7.patch Latest working patch attached. I've cutover to using Lucene's normal segment merging for all merging (ie, I no longer use a different merge-efficient format for segments when autoCommit=false); this has substantially simplified the code. All unit tests pass except disk-full test and certain contrib tests (gdata-server, lucli, similarity, wordnet) that I think I'm not causing. Other changes: * Consolidated flushing of a new segment back into IndexWriter (previously DocumentsWriter would do its own flushing when autoCommit=false). I would also like to consolidate merging entirely into IndexWriter; right now DocumentsWriter does its own merging of the flushed segments when autoCommit=false (this is because those segments are "partial" meaning they do not have their own stored fields or term vectors). I'm trying to find a clean way to do this... * Thread concurrency now works: each thread writes into a separate Postings hash (up until a limit (currently 5) at which point the threads share the Postings hashes) and then when flushing the segment I merge the docIDs together. I flush when the total RAM used across threads is over the limit. I ran a test comparing thread concurrency on current trunk vs this patch, which I'll post next. * Reduced bytes used per-unique-term to be lower than current Lucene. This means the worst-case document (many terms, all of which are unique) should use less RAM overall than Lucene trunk does. * Added some new unit test cases; added missing "writer.close()" to one of the contrib tests. * Cleanup, comments, etc. I think the code is getting more "approachable" now. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]