[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507567 ]
Doron Cohen commented on LUCENE-843: ------------------------------------ Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are *very* short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed. For the docs adding step I had: mergeFactor = 100,000 maxBufferedDocs = 10,000 So no merging took place. This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB. Then I optimized the index with mergeFactor = 400 (Larger values were hard on the open files limits.) I thought it would be interesting to see how the new code performs in this scenario, what do you think? If you too find this comparison interesting, I have two more questions: - what settings do you recommend? - is there any chance for speed-up in optimize()? I didn't read your new code yet, but at least from some comments here it seems that on disk merging was not changed... is this (still) so? I would skip the optimize part if this is not of interest for the comparison. (In fact I am still waiting for my optimize() to complete, but if it is not of interest I will just interrupt it...) Thanks, Doron > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]