[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-843: -------------------------------------- Attachment: LUCENE-843.take5.patch I attached a new iteration of the patch. It's quite different from the last patch. After discussion on java-dev last time, I decided to retry the "persistent hash" approach, where the Postings hash lasts across many docs and then a single flush produces a partial segment containing all of those docs. This is in contrast to the previous approach where each doc makes its own segment and then they are merged. It turns out this is even faster than my previous approach, especially for smaller docs and especially when term vectors are off (because no quicksort() is needed until the segment is flushed). I will attach new benchmark results. Other changes: * Changed my benchmarking tool / testing (IndexLineFiles): - I turned off compound file (to reduce time NOT spent on indexing). - I noticed I was not downcasing the terms, so I fixed that - I now do my own line processing to reduce GC cost of "BufferedReader.readLine" (to reduct time NOT spent on indexing). * Norms now properly flush to disk in the autoCommit=false case * All unit tests pass except disk full * I turned on asserts for unit tests (jvm arg -ea added to junit ant task). I think we should use asserts when running tests. I have quite a few asserts now. With this new approach, as I process each term in the document I immediately write the prox/freq in their compact (vints) format into shared byte[] buffers, rather than accumulating int[] arrays that then need to be re-processed into the vint encoding. This speeds things up because we don't double-process the postings. It also uses less per-document RAM overhead because intermediate postings are stored as vints not as ints. When enough RAM is used by the Posting entries plus the byte[] buffers, I flush them to a partial RAM segment. When enough of these RAM segments have accumulated I flush to a real Lucene segment (autoCommit=true) or to on-disk partial segments (autoCommit=false) which are then merged in the end to create a real Lucene segment. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]