[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492658
]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
> How are you writing the frq data in compressed format? The works fine for
> prx data, because the deltas are all within a single doc -- but for the freq
> data, the deltas are tied up in doc num deltas, so you have to decompress it
> when performing merges.
For each Posting I keep track of the last docID that its term occurred
in; when this differs from the current docID I record the "delta code"
that needs to be written and then I later write it with the final freq
for this document.
> * I haven't been able to come up with a file format tweak that
> gets around this doc-num-delta-decompression problem to enhance the speed
> of frq data merging. I toyed with splitting off the freq from the
> doc_delta, at the price of increasing the file size in the common case of
> freq == 1, but went back to the old design. It's not worth the size
> increase for what's at best a minor indexing speedup.
I'm just doing the "stitching" approach here: it's only the very first
docCode (& freq when freq==1) that must be re-encoded on merging. The
one catch is you must store the last docID of the previous segment so
you can compute the new delta at the boundary. Then I do a raw
"copyBytes" for the remainder of the freq postings.
Note that I'm only doing this for the "internal" merges (of partial
RAM segments and flushed partial segments) I do before creating a real
Lucene segment. I haven't changed how the "normal" Lucene segment
merging works (though I think we should look into it -- I opened a
separate issue): it still re-interprets and then re-encodes all
docID/freq's.
> * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
> chunks, allows resizing (downwards) of only the last allocation, and can
> only release everything at once. From one of these pools, I'm allocating
> RawPosting objects, each of which is a doc_num, a freq, the term_text, and
> the pre-packed prx data (which varies based on which Posting subclass
> created the RawPosting object). I haven't got things 100% stable yet, but
> preliminary results seem to indicate that this technique, which is a riff
> on your persistent arrays, improves indexing speed by about 15%.
Fabulous!!
I think it's the custom memory management I'm doing with slices into
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]