[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

Michael Busch (JIRA) Fri, 05 Mar 2010 08:30:52 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841915#action_12841915
 ]


Michael Busch commented on LUCENE-2293:
---------------------------------------

{quote}
This is a great approach for speeding up NRT - NRT readers will no longer have 
to flush. It's similar in spirit to LUCENE-1313, but that issue is still 
flushing segments (but, into an intermediate RAMDir).
{quote}

I agree! Thinking further about this: Each (re)opened RAM segment reader needs 
to also remember the maxDoc of the corresponding DW at the time it was 
(re)opened. This way we can prevent a RAM reader to read postinglists beyond 
that maxDoc, even if the writer thread keeps building the lists in parallel. 
This allows us to guarantee the point-in-time requirements.

Also, the PostingList objects we store in the TermHash already contain a 
lastDocID (if I remember correctly). So when a RAM reader termEnum iterates the 
dictionary it can skip all terms where term.lastDocID > RAMReader.maxDoc.

It's quite neat that all we have to do in reopen then is to update 
ramReader.maxDoc and ramReader.seqID.

Of course one big thing is still missing: keeping the term dictionary sorted. 
In order to implement the full IndexReader interface, specifically TermEnum, 
it's necessary to give each RAM reader a point-in-time sorted dictionary. At 
least in one direction, as a TermEnum only seeks forward.

I think we have two options here: Either we try to keep the dictionary always 
sorted, whenever a term is added. I guess then we'd have to implement a b-tree 
or something similar?

The second option I can think of is to add a "nextTerm" pointer to 
TermHash.Postinglist. This allows us to build up a linked list across all 
terms. When a ramReader is opened we would sort all terms, but not by changing 
their position in the hash - instead by building the single-linked list in 
sorted order.

When a new reader gets (re)opened we need to mergesort the new terms into the 
linked list. I guess it's easy to get this implemented lock-free. E.g. if you 
have the linked list a->c, and you want to add b in the middle, you set b->c 
before changing a->c. Then it's undefined if an in-flight older reader would 
see term b. The old reader must not return b, since b was added after the old 
reader was (re)opened. So either case is fine: either it doesn't see b cause 
the link wasn't updated yet, or it sees it but doesn't return it, because 
b.lastDocID>ramReader.maxDoc.

The downside is that we will have to pay the price of sorting in reader.reopen, 
which however should be cheap if readers are reopened frequently. Not sure 
though if this linkedlist approach is more or less compelling than something 
like a btree?

Btw: Shall we open a new "searchable DW buffer" issue or continue using this 
issue for these discussions?

> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>
>                 Key: LUCENE-2293
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2293
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

Reply via email to