[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507708 ]
Doron Cohen commented on LUCENE-843: ------------------------------------ Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok *only* after the containing doc was added to the index. For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies:-)), but I might use them later. For the first 100,000,000 docs (==speller words) the speed-up is quite amazing: Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds This is 5.3 times faster !!! This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). I stopped the run now, I don't expect to learn anything new by letting it continue. When trying with MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs. Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]