[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507587 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- > I thought it would be interesting to see how the new code performs in this > scenario, what do you think? Yes I'd be very interested to see the results of this. It's a somewhat "unusual" indexing situation (such tiny docs) but it's a real world test case. Thanks! > - what settings do you recommend? I think these are likely the important ones in this case: * Flush by RAM instead of doc count (writer.setRAMBufferSizeMB(...)). * Give it as much RAM as you can. * Use maybe 3 indexing threads (if you can). * Turn off compound file. * If you have stored fields/vectors (seems not in this case) use autoCommit=false. * Use a trivial analyzer that doesn't create new String/new Token (re-use the same Token, and use the char[] based term text storage instead of the String one). * Re-use Document/Field instances. The DocumentsWriter is fine with this and it saves substantial time from GC especially because your docs are so tiny (per-doc overhead is otherwise a killer). In IndexLineFiles I made a StringReader that lets me reset its String value; this way I didn't have to change the Field instances stored in the Document. > - is there any chance for speed-up in optimize()? I didn't read > your new code yet, but at least from some comments here it seems > that on disk merging was not changed... is this (still) so? I would Correct: my patch doesn't touch merging and optimizing. All it does now is gather many docs in RAM and then flush a new segment when it's time. I've opened a separate issue (LUCENE-856) for optimizations in segment merging. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]