2011/4/29 Michael McCandless <[email protected]>: > Sorry, but, no :) > > So feel free to keep working towards removing this limitation!! > > This change makes IndexWriter's flush (where it writes the added > documents in RAM to disk as a new segment) fully concurrent, so that > while one segment is being flushed (which could take a longish time, > eg on a slowish IO system), other threads are now free to continue > indexing (where they were blocked before). On computers with > substantial CPU concurrency, and fast "enough" IO systems, this change > should give a big increase in indexing throughput. > > That said, I do think this change is a step towards what you seek > (allowing multiple IndexWriters, even in separate JVMs maybe on > separate computers, to write into an index at once). > > Mike
thank you for clarifying this; maybe I don't even need to remove the locking if I can run some of those participant threads in the remote nodes. I'll keep you updated, but unfortunately can't start working on it sooner. Sanne > > http://blog.mikemccandless.com > > On Fri, Apr 29, 2011 at 2:16 PM, Sanne Grinovero > <[email protected]> wrote: >> Hello, >> this is totally awesome! >> >> Does it imply we don't need the IndexWriter lock anymore? And hence >> that people sharing the Lucene Directory across multiple JVMs can have >> both write at the same time? >> >> I had intentions to *try* removing such limitations this summer, but >> if this is the case I will spend my time testing this carefully >> instead, or if some kind of locking is still required I'd appreciate >> some pointers so that I'll be able to remove them. >> >> Regards, >> Sanne >> >> 2011/4/29 Simon Willnauer <[email protected]>: >>> Hey folks, >>> >>> LUCENE-3023 aims to land the considerably large >>> DocumentsWriterPerThread (DWPT) refactoring on trunk. >>> During the last weeks we have put lots of efforts into cleaning the >>> code up, fixing javadocs and run test locally >>> as well as on Jenkins. We reached the point where we are able to >>> create a final patch for review and land this >>> exciting refactoring on trunk very soon. I committed the CHANGES.TXT >>> entry (also appended below) a couple of minutes ago so from now on >>> we freeze the branch for final review (Robert can you create a new >>> "final" patch and upload to LUCENE-3023). >>> Any comments should go to [1] or as a reply to this email. If there is >>> no blocker coming up we plan to reintegrate the >>> branch and commit it to trunk early next week. For those who want some >>> background what DWPT does read: [2] >>> >>> Note: this change will not change the index file format so there is no >>> need to reindex for trunk users. Yet, I will send a heads up next week >>> with an >>> overview of that has changed. >>> >>> Simon >>> >>> [1] https://issues.apache.org/jira/browse/LUCENE-3023 >>> [2] >>> http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/ >>> >>> >>> * LUCENE-2956, LUCENE-2573, LUCENE-2324, LUCENE-2555: Changes from >>> DocumentsWriterPerThread: >>> >>> - IndexWriter now uses a DocumentsWriter per thread when indexing >>> documents. >>> Each DocumentsWriterPerThread indexes documents in its own private >>> segment, >>> and the in memory segments are no longer merged on flush. Instead, each >>> segment is separately flushed to disk and subsequently merged with normal >>> segment merging. >>> >>> - DocumentsWriterPerThread (DWPT) is now flushed concurrently based on a >>> FlushPolicy. When a DWPT is flushed, a fresh DWPT is swapped in so that >>> indexing may continue concurrently with flushing. The selected >>> DWPT flushes all its RAM resident documents do disk. Note: Segment >>> flushes >>> don't flush all RAM resident documents but only the documents private to >>> the DWPT selected for flushing. >>> >>> - Flushing is now controlled by FlushPolicy that is called for every add, >>> update or delete on IndexWriter. By default DWPTs are flushed either on >>> maxBufferedDocs per DWPT or the global active used memory. Once the >>> active >>> memory exceeds ramBufferSizeMB only the largest DWPT is selected for >>> flushing and the memory used by this DWPT is substracted from the active >>> memory and added to a flushing memory pool, which can lead to temporarily >>> higher memory usage due to ongoing indexing. >>> >>> - IndexWriter now can utilize ramBufferSize > 2048 MB. Each DWPT can >>> address >>> up to 2048 MB memory such that the ramBufferSize is now bounded by the >>> max >>> number of DWPT avaliable in the used DocumentsWriterPerThreadPool. >>> IndexWriters net memory consumption can grow far beyond the 2048 MB >>> limit if >>> the applicatoin can use all available DWPTs. To prevent a DWPT from >>> exhausting its address space IndexWriter will forcefully flush a DWPT if >>> its >>> hard memory limit is exceeded. The RAMPerThreadHardLimitMB can be >>> controlled >>> via IndexWriterConfig and defaults to 1945 MB. >>> Since IndexWriter flushes DWPT concurrently not all memory is released >>> immediately. Applications should still use a ramBufferSize significantly >>> lower than the JVMs avaliable heap memory since under high load multiple >>> flushing DWPT can consume substantial transient memory when IO >>> performance >>> is slow relative to indexing rate. >>> >>> - IndexWriter#commit now doesn't block concurrent indexing while flushing >>> all >>> 'currently' RAM resident documents to disk. Yet, flushes that occur >>> while a >>> a full flush is running are queued and will happen after all DWPT >>> involved >>> in the full flush are done flushing. Applications using multiple threads >>> during indexing and trigger a full flush (eg call commmit() or open a new >>> NRT reader) can use significantly more transient memory. >>> >>> - IndexWriter#addDocument and IndexWriter.updateDocument can block indexing >>> threads if the number of active + number of flushing DWPT exceed a >>> safety limit. By default this happens if 2 * max number available thread >>> states (DWPTPool) is exceeded. This safety limit prevents applications >>> from >>> exhausting their available memory if flushing can't keep up with >>> concurrently indexing threads. >>> >>> - IndexWriter only applies and flushes deletes if the maxBufferedDelTerms >>> limit is reached during indexing. No segment flushes will be triggered >>> due to this setting. >>> >>> - IndexWriter#flush(boolean, boolean) doesn't synchronized on IndexWriter >>> anymore. A dedicated flushLock has been introduced to prevent multiple >>> full- >>> flushes happening concurrently. >>> >>> - DocumentsWriter doesn't write shared doc stores anymore. >>> >>> (Mike McCandless, Michael Busch, Simon Willnauer) >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
