[
https://issues.apache.org/jira/browse/LUCENE-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simon Willnauer updated LUCENE-2573:
------------------------------------
Attachment: LUCENE-2573.patch
Here is my first cut / current status on this issue. First of all I have a
couple of failures related to deletes but they seem not to be related
(directly) to this patch since I can reproduce them even without the patch.
all of the failures are related to deletes in some way so I suspect that there
is another issue for that, no?
This patch implements a tiered flush strategy combined with a concurrent flush
approach.
* All decisions are based on a FlushPolicy which operates on a
DocumentsWriterSession (does the ram tracking and housekeeping), once the flush
policy encounters a transition to the next tier it marks the "largest" ram
consuming thread
as flushPending if we transition from a lower level and all threads if we
transition from the upper watermark (level). DocumentsWriterSession shifts the
memory of a pending thread to a new memory "level" (pendingBytes) and marks the
thread as pending.
* Once FlushPolicy#findFlushes(..) returns the caller tries to check if itself
needs to flush and if so it "checks-out" its DWPT and replaces it with a
complete new instance. Releases the lock on the ThreadState and continues to
flush the "checked-out" DWPT. After this is done or if the current DWPT doesn't
need flushing the indexing thread checks if there are any other pending flushes
and tries to (non-blocking) obtain their lock. It only tries to get the lock
and only tries once since if the lock is taken another thread is already
holding it and will see the flushPending once finished adding the document.
This approach tries to utilize as much conccurrency as possible while flushing
the DWPT and releaseing its ThreadState with an entirely new DWPT. Yet, this
might also have problems especially if IO is slow and we filling up indexing
RAM too fast. To prevent us from bloating up the memory too much I introduced a
notation of "healtiness" which operates on the net-bytes used in the
DocumentsWriterSession (flushBytes + pendingBytes + activeBytes) -- (flushBytes
- mem consumption of currently flushing DWPT, pendingBytes - mem consumption of
marked as pending ThreadStates / DWPT, activeBytes mem consuption of the
indexing DWPT). If net-bytes reach a certain threshold (2*maxRam currently) I
stop incoming threads until the session becomes healty again.
I run luceneutil with trunk vs. LUCENE-2573 indexing 300k wikipedia docs with
1GB MaxRamBuffer and 4 Threads. Searches on both indexes yield identical
results (Phew!)
Indexing time in ms look promising
||trunk||patch|| diff ||
|134129 ms|102932 ms|{color:green}23.25%{color}|
This patch is still kind of rough and needs iterations so reviews and questions
are very much welcome.
> Tiered flushing of DWPTs by RAM with low/high water marks
> ---------------------------------------------------------
>
> Key: LUCENE-2573
> URL: https://issues.apache.org/jira/browse/LUCENE-2573
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael Busch
> Assignee: Simon Willnauer
> Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2573.patch, LUCENE-2573.patch, LUCENE-2573.patch,
> LUCENE-2573.patch
>
>
> Now that we have DocumentsWriterPerThreads we need to track total consumed
> RAM across all DWPTs.
> A flushing strategy idea that was discussed in LUCENE-2324 was to use a
> tiered approach:
> - Flush the first DWPT at a low water mark (e.g. at 90% of allowed RAM)
> - Flush all DWPTs at a high water mark (e.g. at 110%)
> - Use linear steps in between high and low watermark: E.g. when 5 DWPTs are
> used, flush at 90%, 95%, 100%, 105% and 110%.
> Should we allow the user to configure the low and high water mark values
> explicitly using total values (e.g. low water mark at 120MB, high water mark
> at 140MB)? Or shall we keep for simplicity the single setRAMBufferSizeMB()
> config method and use something like 90% and 110% for the water marks?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]