Chuck Williams wrote:
Michael McCandless wrote on 01/15/2007 01:49 AM:
Chuck,
Possibly related, one of the ways I improved concurrency in
ParallelWriter was to break up IndexWriter.addDocument() into one method
to invert the document and create a RAMSegment and a second method that
takes the RAMSegment and merges it into the index. This allows
inversions to be processed in parallel, while merging is already a
critical section. (Side thought: I've been wondering how hard it would
be to make merging not a critical section). I had thought of the method
to take the RAMSegment and merge it to be the "commit" part of
addDocument().
>>>
Your notion of commit is much better and more flexible, but perhaps you
could include this inversion/merge separation as well?
>>
I'm a little confused on what this would mean? Do you mean opening up
separate public methods: one to invert (and get a segment back) and
one to append (and possibly merge) a segment to the index (keeping the
existing addDocument that would then just call these two)? How would
this buy you more concurrency (since the current method indeed only
synchronizes around the merge part)? Oh: would you behind the scenes
take each "single doc" segment and pre-merge them privatelyx,
concurrently, possibly up to many levels, privately, and then finally
add the merged segment into the index? Ie, the beginnings of
"concurrent merge" described above?
Actually couldn't we do this change today (ie without waiting for
explicit commits)? It seems like a separable change.
Yes, I've already made this change so it is independent, creating
invertDocument(), addInvertedDocument() and abortInvertedDocument().
This enables more concurrency in ParallelWriter because there are no
synchronization restrictions at all on calling invertDocument().
addInvertedDocument() has a synchronization requirement: it can be
called in parallel for each subdocument corresponding to the same
document, but not for subdocuments corresponding to different documents
as this could break the required parallel subindex doc-id
correspondence. Because addDocument() (which is just
addInvertedDocument(invertDocument())) contains the call to
addInvertedDocument() it has the same synchronization requirement,
preventing the extra parallelism in the invertDocument() calls.
It seemed to me that this could be related to the your explicit-commits
idea since it also breaks up writes into an uncommitted local portion
and committed portion.
Ahh I think I see: you needed to tease out that fine detail on what
synchronization is actually required (the fact that sub-documents can
be done entirely in parallel, but cross-documents cannot). And the
sub-documents indeed give you excellent concurrency (if you make lots
of sub-documents) on boxes that have the CPU resources to allocate.
This is a neat change, but I think separate from from explicit commits
so I think we should keep them decoupled at this point.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]