[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507716 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- > Just to clarify your comment on reusing field and doc instances - to my > understanding reusing a field instance is ok *only* after the containing > doc was added to the index. Right, if your documents are very "regular" you should get a sizable speedup (especially for tiny docs), with or without this patch, if you make a single Document and add *separate* Field instances to it for each field, and then reuse the Document and Field instances for all the docs you want to add. It's not easy to reuse Field instances now (there's no setStringValue()). I made a ReusableStringReader to do this but you could also make your own class that implements Fieldable. > For a "fair" comparison I ended up not following most of your > recommendations, including the reuse field/docs one and the non-compound > one (apologies:-)), but I might use them later. OK, when you say "fair" I think you mean because you already had a previous run that used compound file, you had to use compound file in the run with the LUCENE-843 patch (etc)? The recommendations above should speed up Lucene with or without my patch. > For the first 100,000,000 docs (==speller words) the speed-up is quite > amazing: > Orig: Speller: added 100000000 words in 10912 seconds = 3 hours 1 > minutes 52 seconds > New: Speller: added 100000000 words in 58490 seconds = 16 hours 14 > minutes 50 seconds > This is 5.3 times faster !!! Wow! I think the speedup might be even more if both of your runs followed the suggestions above. > This btw was with maxBufDocs=100,000 (I forgot to set the MEM param). > I stopped the run now, I don't expect to learn anything new by letting it > continue. > > When trying with MEM=512MB, it at first seemed faster, but then there > were now and then local slow-downs, and eventually it became a bit slower > than the previous run. I know these are not merges, so they are either > flushes (RAM directed), or GC activity. I will perhaps run with GC debug > flags and perhaps add a print at flush so to tell the culprit for these > local slow-downs. Hurm, odd. I haven't pushed RAM buffer up to 512 MB so it could be GC cost somehow makes things worse ... curious. > Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) > with this patch. The way I indexed it before it took about 4 days - > running in 4 threads, and creating 36 indexes. This is even more a real > life scenario, it involves HTML parsing, standard analysis, and merging > (to some extent). Since there are 4 threads each one will get, say, > 250MB. Again, for a "fair" comparison, I will remain with compound. OK, because you're doing StandardAnalyzer and HTML parsing and presumably loading one-doc-per-file, most of your time is spent outside of Lucene indexing so I'd expect less that 50% speedup in this case. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]