Glen Newton wrote:
2008/10/23 Michael McCandless <[EMAIL PROTECTED]>:
Mark Miller wrote:
Glen Newton wrote:
2008/10/23 Mark Miller <[EMAIL PROTECTED]>:
It sounds like you might have some thread synchronization issues
outside
of
Lucene. To simplify things a bit, you might try just using one
IndexWriter.
If I remember right, the IndexWriter is now pretty efficient,
and there
isn't much need to index to smaller indexes and then merge.
There is a
lot
of juggling to get wrong with that approach.
While I agree it is easier to have a single IndexWriter, if you
have
multiple cores you will get significant speed-ups with multiple
IndexWriters, even with the impact of merging at the end.
#IndexWriters = # physical cores is an reasonable rule of thumb.
General speed-up estimate: # cores * 0.6 - 0.8 over single
IndexWriter
YMMV
When I get around to it, I'll re-run my tests varying the # of
IndexWriters & post.
-Glen
Hey Mr McCandless, whats up with that? Can IndexWriter be made to
be as
efficient as using Multiple Writers? Where do you suppose the hold
up is?
Number of threads doing merges? Sync contention? I hate the idea
of multiple
IndexWriter/Readers being more efficient than a single instance.
In an ideal
Lucene world, a single instance would hide the complexity and use
the number
of threads needed to match multiple instance performance.
Honestly this surprises me: I would expect a single IndexWriter with
multiple threads to be as fast (or faster, considering the extra
merge time
at the end) than multiple IndexWriters.
IndexWriter's concurrency has improved alot lately, with
ConcurrentMergeScheduler. The only serious operation that is not
concurrent
is flushing the RAM buffer as a new segment; but in a well tuned
indexing
process (large RAM buffer) the time spent there should be quite
small,
especially with a fast IO system.
Actually, addIndexes is also not concurrent in that if multiple
threads call
it, only one can run at once. But normally you would call it with
all the
indices you want to add, and then the merging is concurrent.
Glen, in your single IndexWriter test, is it possible there was
accidental
thread contention during document preparation or analysis?
I don't think there is. I've been refining this for quite a while, and
have done a lot of analysis and hand-checking of the threading stuff.
OK.
For your multiple-index-writer test, how much time is spent building
the N indices vs merging them in the end?
I do use multiple threads for document creation: this is where much of
the speed-up happens (at least in my case where I have a large indexed
field for the full-text of an article: the parsing becomes a
significant part of the process).
So in the single-index-writer vs multiple-index-writer tests, this
part (64 threads that construct document objects) is unchanged, right?
How do you rate limit the 64 threads? (Ie, slow them down when they
get too far ahead of indexing).
If you only process documents with the 64 threads (but not index
them), what percentage of the total time is that? I'd like to tease
out "building documents" vs "indexing" times.
I do agree that we should strive to have enough concurrency in
IndexWriter
and IndexReader so that you don't get any real benefit by using
separate
instances. Eg in 2.4.0 you can now open read-only IndexReaders, and
on Unix
you can use NIOFSDirectory, both of which should go a long ways
towards
fixing IndexReader's concurrency issue.
My original tests were in the Spring with 2.3.1. I am planning on
doing the new tests with 2.4 for indexing, as well as re-doing my
concurrent query tests[1] and concurrent multiple reader tests[2]
using the features you describe. I am sure the results will be quite
different...
Also, for the indexing tests, make sure you run with autoCommit=false.
BTW the files I am indexing were originally PDFs, but were batch
converted to text and stored compressed on the filesystem, so except
for GUnzipping them there is no other overhead.
But I'm confused: why do you need 64 threads to build up the
documents? Gunzipping should be very low CPU cost. Are you pre-
analyzing the fields on your documents?
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]