2008/10/23 Michael McCandless <[EMAIL PROTECTED]>:
>
> Mark Miller wrote:
>
>> Glen Newton wrote:
>>>
>>> 2008/10/23 Mark Miller <[EMAIL PROTECTED]>:
>>>
>>>> It sounds like you might have some thread synchronization issues outside
>>>> of
>>>> Lucene. To simplify things a bit, you might try just using one
>>>> IndexWriter.
>>>> If I remember right, the IndexWriter is now pretty efficient, and there
>>>> isn't much need to index to smaller indexes and then merge. There is a
>>>> lot
>>>> of juggling to get wrong with that approach.
>>>>
>>>
>>> While I agree it is easier to have a single IndexWriter, if you have
>>> multiple cores you will get significant speed-ups with multiple
>>> IndexWriters, even with the impact of merging at the end.
>>> #IndexWriters = # physical cores is an reasonable rule of thumb.
>>>
>>> General speed-up estimate: # cores * 0.6 - 0.8  over single IndexWriter
>>> YMMV
>>>
>>> When I get around to it, I'll re-run my tests varying the # of
>>> IndexWriters & post.
>>>
>>> -Glen
>>>
>> Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as
>> efficient as using Multiple Writers? Where do you suppose the hold up is?
>> Number of threads doing merges? Sync contention? I hate the idea of multiple
>> IndexWriter/Readers being more efficient than a single instance. In an ideal
>> Lucene world, a single instance would hide the complexity and use the number
>> of threads needed to match multiple instance performance.
>
> Honestly this surprises me: I would expect a single IndexWriter with
> multiple threads to be as fast (or faster, considering the extra merge time
> at the end) than multiple IndexWriters.
>
> IndexWriter's concurrency has improved alot lately, with
> ConcurrentMergeScheduler.  The only serious operation that is not concurrent
> is flushing the RAM buffer as a new segment; but in a well tuned indexing
> process (large RAM buffer) the time spent there should be quite small,
> especially with a fast IO system.
>
> Actually, addIndexes is also not concurrent in that if multiple threads call
> it, only one can run at once.  But normally you would call it with all the
> indices you want to add, and then the merging is concurrent.
>
> Glen, in your single IndexWriter test, is it possible there was accidental
> thread contention during document preparation or analysis?

I don't think there is. I've been refining this for quite a while, and
have done a lot of analysis and hand-checking of the threading stuff.

I do use multiple threads for document creation: this is where much of
the speed-up happens (at least in my case where I have a large indexed
field for the full-text of an article: the parsing becomes a
significant part of the process).

> I do agree that we should strive to have enough concurrency in IndexWriter
> and IndexReader so that you don't get any real benefit by using separate
> instances. Eg in 2.4.0 you can now open read-only IndexReaders, and on Unix
> you can use NIOFSDirectory, both of which should go a long ways towards
> fixing IndexReader's concurrency issue.

My original tests were in the Spring with 2.3.1. I am planning on
doing the new tests with 2.4 for indexing, as well as re-doing my
concurrent query tests[1] and concurrent multiple reader tests[2]
using the features you describe. I am sure the results will be quite
different...

BTW the files I am indexing were originally PDFs, but were batch
converted to text and stored compressed on the filesystem, so except
for GUnzipping them there is no other overhead.

[1]http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
[2]http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html

-glen

> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to