[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492655 ]
Marvin Humphrey commented on LUCENE-843: ---------------------------------------- How are you writing the frq data in compressed format? The works fine for prx data, because the deltas are all within a single doc -- but for the freq data, the deltas are tied up in doc num deltas, so you have to decompress it when performing merges. To continue our discussion from java-dev... * I haven't been able to come up with a file format tweak that gets around this doc-num-delta-decompression problem to enhance the speed of frq data merging. I toyed with splitting off the freq from the doc_delta, at the price of increasing the file size in the common case of freq == 1, but went back to the old design. It's not worth the size increase for what's at best a minor indexing speedup. * I've added a custom MemoryPool class to KS which grabs memory in 1 meg chunks, allows resizing (downwards) of only the last allocation, and can only release everything at once. From one of these pools, I'm allocating RawPosting objects, each of which is a doc_num, a freq, the term_text, and the pre-packed prx data (which varies based on which Posting subclass created the RawPosting object). I haven't got things 100% stable yet, but preliminary results seem to indicate that this technique, which is a riff on your persistent arrays, improves indexing speed by about 15%. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]