[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492658 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- > How are you writing the frq data in compressed format? The works fine for > prx data, because the deltas are all within a single doc -- but for the freq > data, the deltas are tied up in doc num deltas, so you have to decompress it > when performing merges. For each Posting I keep track of the last docID that its term occurred in; when this differs from the current docID I record the "delta code" that needs to be written and then I later write it with the final freq for this document. > * I haven't been able to come up with a file format tweak that > gets around this doc-num-delta-decompression problem to enhance the speed > of frq data merging. I toyed with splitting off the freq from the > doc_delta, at the price of increasing the file size in the common case of > freq == 1, but went back to the old design. It's not worth the size > increase for what's at best a minor indexing speedup. I'm just doing the "stitching" approach here: it's only the very first docCode (& freq when freq==1) that must be re-encoded on merging. The one catch is you must store the last docID of the previous segment so you can compute the new delta at the boundary. Then I do a raw "copyBytes" for the remainder of the freq postings. Note that I'm only doing this for the "internal" merges (of partial RAM segments and flushed partial segments) I do before creating a real Lucene segment. I haven't changed how the "normal" Lucene segment merging works (though I think we should look into it -- I opened a separate issue): it still re-interprets and then re-encodes all docID/freq's. > * I've added a custom MemoryPool class to KS which grabs memory in 1 meg > chunks, allows resizing (downwards) of only the last allocation, and can > only release everything at once. From one of these pools, I'm allocating > RawPosting objects, each of which is a doc_num, a freq, the term_text, and > the pre-packed prx data (which varies based on which Posting subclass > created the RawPosting object). I haven't got things 100% stable yet, but > preliminary results seem to indicate that this technique, which is a riff > on your persistent arrays, improves indexing speed by about 15%. Fabulous!! I think it's the custom memory management I'm doing with slices into shared byte[] arrays for the postings that made the persistent hash approach work well, this time around (when I had previously tried this it was slower). > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]