Re: adding "explicit commits" to Lucene?

Michael McCandless Mon, 15 Jan 2007 03:50:27 -0800

Chuck,

This seems to me to be a great idea, especially the ability to support
index transactions.


ParallelWriter (original implementation in LUCENE-600 -- I have a much
better one now) provides a companion writer to ParallelReader.  It takes
a Document, breaks it up into subdocuments associated with parallel
indexes that partition the fields, and writes those subdocuments into
their respective parallel indexes.  ParallelReader requires that the
parallel indexes remain doc-id synchronized, which severely limits the
opportunity for concurrent writing due to the possibility of the reader
reopening when the indexes are out of sync (more Documents in one than
another) and due to errors writing some subdocument(s) of a set when the
others succeed.

The new version of ParallelWriter, not in jira yet, provides more
concurrency and provides better error recovery than the version there
now, but it still limited in possible concurrency and in the worst case
(when other recovery options fail) may have to fully optimize the
indexes to back out the case were only a subset of the subdocuments
derived from a given document fail to write.  The root cause for the
horrible error recovery case is the uncontrollable and unrevertable
merging that may arise from adding a single document.

I believe what you propose would provide the foundation to fully solve
these problems efficiently, yielding much more concurrency and
guaranteeing efficient error recovery in ParallelWriter.  Also it would
simplify some other cases where transactional integrity is essential in
my current app.  So this really sounds great.


Neat!!  This sounds like a perfect fit: with explicit commits in the
index you should be able to greatly simplify ParallelWriter because
you're safe knowing readers would never open an "update in progress"
(ie a checkpoint segmentsx_N), and if you hit any error, you can
easily re-open your ParallelWriter against the last committed snapshot
(segments_N).  Ie your error recovery becomes trivial and correct.

I had not thought of this use case.  I think there are lots of
important use cases lurking out there that are enabled once we
have explicit commits.

Possibly related, one of the ways I improved concurrency in
ParallelWriter was to break up IndexWriter.addDocument() into one method
to invert the document and create a RAMSegment and a second method that
takes the RAMSegment and merges it into the index.  This allows
inversions to be processed in parallel, while merging is already a
critical section.  (Side thought:  I've been wondering how hard it would
be to make merging not a critical section).  I had thought of the method
to take the RAMSegment and merge it to be the "commit" part of
addDocument().


Re side thought:

I think this may be another use case enabled by explicit commits: you
could imagine separate threads building up / merging their own private
set of segments and then merely adding them into the primary index.
What explicit commits can buy you is the fact that all these "private
segments" need not be made searchable until a commit() is called.  So
in-between commits there should be alot of room for concurrency in
merging segments.

Your notion of commit is much better and more flexible, but perhaps you
could include this inversion/merge separation as well?


I'm a little confused on what this would mean?  Do you mean opening up
separate public methods: one to invert (and get a segment back) and
one to append (and possibly merge) a segment to the index (keeping the
existing addDocument that would then just call these two)?  How would
this buy you more concurrency (since the current method indeed only
synchronizes around the merge part)?  Oh: would you behind the scenes
take each "single doc" segment and pre-merge them privatelyx,
concurrently, possibly up to many levels, privately, and then finally
add the merged segment into the index?  Ie, the beginnings of
"concurrent merge" described above?

Actually couldn't we do this change today (ie without waiting for
explicit commits)?  It seems like a separable change.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: adding "explicit commits" to Lucene?

Reply via email to