[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841106#action_12841106
 ] 

Shai Erera commented on LUCENE-2293:
------------------------------------

bq. The IndexWriter (or a new class) would have the doc queue, basically a load 
balancer, that multiple DocumentsWriter instances would pull from as soon as 
they are done inverting the previous document?

Today, DW enforces thread binding - the same thread will always receive the 
same ThreadState. This allows applications who distribute the documents between 
threads based on some criteria, to get a locality of Documents indexed by each 
thread. I can't think of why an application would rely on that, but still 
that's something that happens.

Also, in the pull approach, Lucene would introduce another place where it 
allocates threads. Not only would we need to allow setting that concurrency 
level, we'd also need to allow overriding how a thread is instantiated. That 
will change the way applications are written today - I assume lots of 
applications that are multi-threaded rely on the multiple threads to index the 
documents. But now those threads won't do anything besides register a document 
in a queue. Therefore such applications will need to move to single-threaded 
indexing (because multi-threaded gives them nothing), and control the threads 
IW allocates.

I personally prefer to leave multi-threaded indexing to the application. If it 
anyway contains a queue of incoming documents (from the outside) and allocates 
threads to process them in parallel (for example to parse rich text documents, 
fetch content from remote machines etc.), we wouldn't want them to do all this 
just to waste those threads at the end and let IW control another level of 
concurrency.

Another downside of such approach is that it breaks backward compatibility in a 
new way we've never considered. If the application allocates threads from a 
pool, and we introduce a new IW/DW w/ concurrency level=3 (for example), then 
the application will suddenly spawn more threads that it intended to. Perhaps 
it chose to use SMS, or overrode CMS to handle the threads allocation, but it's 
definitely not ready to handle another thread allocator.

Another thing is that the queue cannot be of just Document objects, but a 
DocAndOp objects to account for add/delete/updates ... another complication.

My preference is to keep the queue to the application.

bq. The other downside is that you would have to buffer deleted docs and 
queries separately for each thread state

Just for clarity - you'll need to do it with the queue approach as well, right? 
I mean, a DW which pulled an operation from the queue, which is a DELETE op, 
will need to cache that DELETE so that it will be executed on all documents 
that were indexed up until flush. So that does not save anything vs. if we 
change DW to flush by ThreadState.

Instead, I prefer to take advantage of the application's concurrency level in 
the following way:
* Each thread will continue to write documents to a ThreadState. We'll allow 
changing the MAX_LEVEL, so if an app wants to get more concurrency, it can.
** MAX_LEVEL will set the number of ThreadState objects available.
* All threads will obtain memory buffers from a pull which will be limited by 
IW's RAM limit.
* When a thread finishes indexing a document and realizes the pool has been 
exhausted, it flushes its ThreadState.
** At that moment, that ThreadState is pulled out of the 'active' list and is 
flushed. When it's done, it reclaims its used buffers and being put again in 
the active list.
** New threads that come in will simply pick a ThreadState from the pool (but 
we'll bind them to that instance until it's flushed) and add documents to them.
** That way, we hijack an application thread to do the flushing, which is 
anyway what happens today.

That way we are less likely to reach a state like Mike described - "big burst 
of CPU only" then "big burst of IO only" - and more likely to balance the two.

If the application wants to be single threaded, we allow it to be like that all 
the way through, not introducing more thread allocations. Otherwise, we let it 
control its concurrency level and use it to our needs.

> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>
>                 Key: LUCENE-2293
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2293
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to