[ 
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2573:
------------------------------------

    Attachment: LUCENE-2573.patch

Here is my first cut / current status on this issue. First of all I have a 
couple of failures related to deletes but they seem not to be related 
(directly) to this patch since I can reproduce them even without the patch. 
all of the failures are related to deletes in some way so I suspect that there 
is another issue for that, no?

This patch implements a tiered flush strategy combined with a concurrent flush 
approach. 

* All decisions are based on a FlushPolicy which operates on a 
DocumentsWriterSession (does the ram tracking and housekeeping), once the flush 
policy encounters a transition to the next tier it marks the "largest" ram 
consuming thread 
as flushPending if we transition from a lower level and all threads if we 
transition from the upper watermark (level). DocumentsWriterSession shifts the 
memory of a pending thread to a new memory "level" (pendingBytes) and marks the 
thread as pending. 

* Once FlushPolicy#findFlushes(..) returns the caller tries to check if itself 
needs to flush and if so it "checks-out" its DWPT and replaces it with a 
complete new instance. Releases the lock on the ThreadState and continues to 
flush the "checked-out" DWPT. After this is done or if the current DWPT doesn't 
need flushing the indexing thread checks if there are any other pending flushes 
and tries to (non-blocking) obtain their lock. It only tries to get the lock 
and only tries once since if the lock is taken another thread is already 
holding it and will see the flushPending once finished adding the document.


This approach tries to utilize as much conccurrency as possible while flushing 
the DWPT and releaseing its ThreadState with an entirely new DWPT. Yet, this 
might also have problems especially if IO is slow and we filling up indexing 
RAM too fast. To prevent us from bloating up the memory too much I introduced a 
notation of "healtiness" which operates on the net-bytes used in the 
DocumentsWriterSession (flushBytes + pendingBytes + activeBytes) -- (flushBytes 
- mem consumption of currently flushing DWPT, pendingBytes - mem consumption of 
marked as pending ThreadStates / DWPT, activeBytes mem consuption of the 
indexing DWPT). If net-bytes reach a certain threshold (2*maxRam currently) I 
stop incoming threads until the session becomes healty again.

I run luceneutil with trunk vs. LUCENE-2573 indexing 300k wikipedia docs with 
1GB MaxRamBuffer and 4 Threads. Searches on both indexes yield identical 
results (Phew!) 
Indexing time in ms look promising
||trunk||patch|| diff ||
|134129 ms|102932 ms|{color:green}23.25%{color}| 

This patch is still kind of rough and needs iterations so reviews and questions 
are very much welcome.




> Tiered flushing of DWPTs by RAM with low/high water marks
> ---------------------------------------------------------
>
>                 Key: LUCENE-2573
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2573
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch, 
> LUCENE-2573.patch
>
>
> Now that we have DocumentsWriterPerThreads we need to track total consumed 
> RAM across all DWPTs.
> A flushing strategy idea that was discussed in LUCENE-2324 was to use a 
> tiered approach:  
> - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
> - Flush all DWPTs at a high water mark (e.g. at 110%)
> - Use linear steps in between high and low watermark:  E.g. when 5 DWPTs are 
> used, flush at 90%, 95%, 100%, 105% and 110%.
> Should we allow the user to configure the low and high water mark values 
> explicitly using total values (e.g. low water mark at 120MB, high water mark 
> at 140MB)?  Or shall we keep for simplicity the single setRAMBufferSizeMB() 
> config method and use something like 90% and 110% for the water marks?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to