Re: Code Freeze on realtime_search branch

Sanne Grinovero Fri, 29 Apr 2011 12:45:17 -0700

2011/4/29 Michael McCandless <[email protected]>:
> Sorry, but, no :)
>
> So feel free to keep working towards removing this limitation!!
>
> This change makes IndexWriter's flush (where it writes the added
> documents in RAM to disk as a new segment) fully concurrent, so that
> while one segment is being flushed (which could take a longish time,
> eg on a slowish IO system), other threads are now free to continue
> indexing (where they were blocked before).  On computers with
> substantial CPU concurrency, and fast "enough" IO systems, this change
> should give a big increase in indexing throughput.
>
> That said, I do think this change is a step towards what you seek
> (allowing multiple IndexWriters, even in separate JVMs maybe on
> separate computers, to write into an index at once).
>
> Mike


thank you for clarifying this; maybe I don't even need to remove the
locking if I can run some of those participant threads in the remote
nodes.
I'll keep you updated, but unfortunately can't start working on it sooner.

Sanne


>
> http://blog.mikemccandless.com
>
> On Fri, Apr 29, 2011 at 2:16 PM, Sanne Grinovero
> <[email protected]> wrote:
>> Hello,
>> this is totally awesome!
>>
>> Does it imply we don't need the IndexWriter lock anymore? And hence
>> that people sharing the Lucene Directory across multiple JVMs can have
>> both write at the same time?
>>
>> I had intentions to *try* removing such limitations this summer, but
>> if this is the case I will spend my time testing this carefully
>> instead, or if some kind of locking is still required I'd appreciate
>> some pointers so that I'll be able to remove them.
>>
>> Regards,
>> Sanne
>>
>> 2011/4/29 Simon Willnauer <[email protected]>:
>>> Hey folks,
>>>
>>> LUCENE-3023 aims to land the considerably large
>>> DocumentsWriterPerThread (DWPT) refactoring on trunk.
>>> During the last weeks we have put lots of efforts into cleaning the
>>> code up, fixing javadocs and run test locally
>>> as well as on Jenkins. We reached the point where we are able to
>>> create a final patch for review and land this
>>> exciting refactoring on trunk very soon. I committed the CHANGES.TXT
>>> entry (also appended below) a couple of minutes ago so from now on
>>> we freeze the branch for final review (Robert can you create a new
>>> "final" patch and upload to LUCENE-3023).
>>> Any comments should go to [1] or as a reply to this email. If there is
>>> no blocker coming up we plan to reintegrate the
>>> branch and commit it to trunk early next week. For those who want some
>>> background what DWPT does read: [2]
>>>
>>> Note: this change will not change the index file format so there is no
>>> need to reindex for trunk users. Yet, I will send a heads up next week
>>> with an
>>> overview of that has changed.
>>>
>>> Simon
>>>
>>> [1] https://issues.apache.org/jira/browse/LUCENE-3023
>>> [2] 
>>> http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/
>>>
>>>
>>> * LUCENE-2956, LUCENE-2573, LUCENE-2324, LUCENE-2555: Changes from
>>>  DocumentsWriterPerThread:
>>>
>>>  - IndexWriter now uses a DocumentsWriter per thread when indexing 
>>> documents.
>>>    Each DocumentsWriterPerThread indexes documents in its own private 
>>> segment,
>>>    and the in memory segments are no longer merged on flush.  Instead, each
>>>    segment is separately flushed to disk and subsequently merged with normal
>>>    segment merging.
>>>
>>>  - DocumentsWriterPerThread (DWPT) is now flushed concurrently based on a
>>>    FlushPolicy.  When a DWPT is flushed, a fresh DWPT is swapped in so that
>>>    indexing may continue concurrently with flushing.  The selected
>>>    DWPT flushes all its RAM resident documents do disk.  Note: Segment 
>>> flushes
>>>    don't flush all RAM resident documents but only the documents private to
>>>    the DWPT selected for flushing.
>>>
>>>  - Flushing is now controlled by FlushPolicy that is called for every add,
>>>    update or delete on IndexWriter. By default DWPTs are flushed either on
>>>    maxBufferedDocs per DWPT or the global active used memory. Once the 
>>> active
>>>    memory exceeds ramBufferSizeMB only the largest DWPT is selected for
>>>    flushing and the memory used by this DWPT is substracted from the active
>>>    memory and added to a flushing memory pool, which can lead to temporarily
>>>    higher memory usage due to ongoing indexing.
>>>
>>>  - IndexWriter now can utilize ramBufferSize > 2048 MB. Each DWPT can 
>>> address
>>>    up to 2048 MB memory such that the ramBufferSize is now bounded by the 
>>> max
>>>    number of DWPT avaliable in the used DocumentsWriterPerThreadPool.
>>>    IndexWriters net memory consumption can grow far beyond the 2048 MB 
>>> limit if
>>>    the applicatoin can use all available DWPTs. To prevent a DWPT from
>>>    exhausting its address space IndexWriter will forcefully flush a DWPT if 
>>> its
>>>    hard memory limit is exceeded. The RAMPerThreadHardLimitMB can be 
>>> controlled
>>>    via IndexWriterConfig and defaults to 1945 MB.
>>>    Since IndexWriter flushes DWPT concurrently not all memory is released
>>>    immediately. Applications should still use a ramBufferSize significantly
>>>    lower than the JVMs avaliable heap memory since under high load multiple
>>>    flushing DWPT can consume substantial transient memory when IO 
>>> performance
>>>    is slow relative to indexing rate.
>>>
>>>  - IndexWriter#commit now doesn't block concurrent indexing while flushing 
>>> all
>>>    'currently' RAM resident documents to disk. Yet, flushes that occur 
>>> while a
>>>    a full flush is running are queued and will happen after all DWPT 
>>> involved
>>>    in the full flush are done flushing. Applications using multiple threads
>>>    during indexing and trigger a full flush (eg call commmit() or open a new
>>>    NRT reader) can use significantly more transient memory.
>>>
>>>  - IndexWriter#addDocument and IndexWriter.updateDocument can block indexing
>>>    threads if the number of active + number of flushing DWPT exceed a
>>>    safety limit. By default this happens if 2 * max number available thread
>>>    states (DWPTPool) is exceeded. This safety limit prevents applications 
>>> from
>>>    exhausting their available memory if flushing can't keep up with
>>>    concurrently indexing threads.
>>>
>>>  - IndexWriter only applies and flushes deletes if the maxBufferedDelTerms
>>>    limit is reached during indexing. No segment flushes will be triggered
>>>    due to this setting.
>>>
>>>  - IndexWriter#flush(boolean, boolean) doesn't synchronized on IndexWriter
>>>    anymore. A dedicated flushLock has been introduced to prevent multiple 
>>> full-
>>>    flushes happening concurrently.
>>>
>>>  - DocumentsWriter doesn't write shared doc stores anymore.
>>>
>>>  (Mike McCandless, Michael Busch, Simon Willnauer)
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Code Freeze on realtime_search branch

Reply via email to