[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506974 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- > Do you think your code is easily extensible in this regard? I'm > wondering because of all the optimizations you're doing like e. g. > sharing byte arrays. But I'm certainly not familiar enough with your code > yet, so I'm only guessing here. Good question! DocumentsWriter is definitely more complex than DocumentWriter, but it doesn't prevent extensibility and I think will work very well when we do flexible indexing. The patch now has dedicated methods for writing into the freq/prox/etc streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte', 'writeProxVInt', etc.), but, this could easily be changed to instead use true IndexOutput streams. This would then hide all details of shared byte arrays from whoever is doing the writing. The way I roughly see flexible indexing working in the future is DocumentsWriter will be responsible for keeping track of unique terms seen (in its hash table), holding the Posting instance (which could be subclassed in the future) for each term, flushing a real segment when full, handling shared byte arrays, etc. Ie all the "infrastructure". But then the specific logic of what bytes are written into which streams (freq/prox/vectors/others) will be handled by a separate class or classes that we can plug/unplug according to some "schema". DocumentsWriter would call on these classes and provide the IndexOutput's for all streams for the Posting, per position, and these classes write their own format into the IndexOutputs. I think a separation like that would work well: we could have good performance and also extensibility. Devil is in the details of course... I obviously haven't factored DocumentsWriter in this way (it has its own addPosition that writes the current Lucene index format) but I think this is very doable in the future. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: index.presharedstores.cfs.zip, > index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, > LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, > LUCENE-843.take9.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]