Re: real time updates

Michael McCandless Sun, 15 Mar 2009 14:45:19 -0700


Marvin Humphrey wrote:

On Sat, Mar 14, 2009 at 05:51:43AM -0400, Michael McCandless wrote:
Even w/ background merging, which allows new segments to be written &
reopened in a reader even while the big merge is running in the BG,
Lucene still has the challenge of warming a reader on the [large]
newly merged segment before using the reader "for real".
Lucy doesn't have to worry about the warming aspect; givensufficient RAM, allthe files in the recently written segment will still be "hot" in theOS file
cache.
The trick we need to master is the coordination of two concurrentwrite
processes.  I think it goes something like this:
* The background consolidator writer grabs "consolidate.lock". Itstartswriting its own segment based on the state of the index at thatmoment.* Meanwhile, an indeterminate number of consolidator-aware writeprocesses
   launch and complete.


So eg you could merge 2 sets of segments at once (like Lucene)?

These processes are forbidden from merging any files
 that pre-date the establishment of "consolidate.lock".

Why? It seems like it needs to merge segments created before itacquired

that lock (that's why it was launched).

* Once the consolidator finishes most of what it's doing, it waitsto obtaina write lock. The only task left is to carry forward newdeletions whichhave been made since the establishment of "consolidate.lock"against thesegments which the consolidator has just merged away. Itfinishes thattask, commits, releases "write.lock", releases"consolidate.lock",then
   exits.

That, and update the master "segments" file to actually record themerge, and

incRef/decRef to delete files.

Does that sound similar to the Lucene implementation?


Yes.

But, what if while a large merge is happening, and enough segments have
been written to warrant a small merge to kick off & finish?

We need an incremental copy-on-write solution (eg only the "page"that'schange gets copied when a new deletion arrives). We need this forchanges
to norms too.
Norms, huh? That's weird. Do you have to do that because a fielddefinition
has been modified?


No, it's to handle someone calling IndexReader.setNorm, eg if they are
doing "realtime boosting".

But then does deletions-seg_2.bv contain all deletes for seg_2?In whichcase this is just like the "generation" Lucene increments & tackson when
it saves a del; just a different naming scheme.
That's right, it's just a different naming scheme. In fact, it'smarginallyless efficient because the bit vector must be copied a little moreoften.
However, with that change, segment directories are truly nevermodified oncewritten. For somewhat esoteric reasons, that made it easier tofactor asensible DeletionsWriter out of the existing KinoSearch indexingcode so that
we could plug in alternative implementations.


OK.

Mike

Re: real time updates

Reply via email to