Re: Lock-less commits

Michael McCandless Fri, 18 Aug 2006 08:57:50 -0700

The basic idea is to change all commits (from SegmentReader or
IndexWriter) so that we never write to an existing file that a reader
could be reading from.  Instead, always write to a new file name using
sequentially numbered files.  For example, for "segments", on every
commit, write to a the sequence: segments.1, segments.2, segments.3,
etc.  Likewise for the *.del and *.fN (norms) files that
SegmentReaders write to.


Interesting idea...
How do you get around races between opening and deleting?

I assume for the writer, you would
 1) write new segments
 2) write new 'segments.3'
 3) delete unused segments (those referenced by 'segments.2')

But what happens when a reader comes along at point 1.5, say, opens
the latest 'segments.2' file, and then tries to open some of the
segments files at 3.5?
I guess the reader could retry... checking for a new segments file.
This could happen more than once (hopefully it wouldn't lead to
starvation... that would be unlikely).


Yes, exactly.

And specifically, the reader only retries if, on hitting a FileNotFoundexception, it then checks & sees that a newer segments file isavailable. This way if there is a "true" FileNotFound exception due tosome sort of index corruption or something, we will [correctly] throw it.

It could in theory lead to starvation but this should be rare inpractice unless you have an IndexWriter that's constantly committing.

Also note that this should be no worse than what we have today, whereyou would also likely hit starvation and get a "Lock obtain timed out"thrown (eg see http://issues.apache.org/jira/browse/LUCENE-307).

In my stress test (shared index with writer accessing it over NFS and 3reader threads doing "open indexsearcher; search" over and over, viaSamba share) the IndexSearchers do retry but so far never more thanonce. Of course this will depend heavily on details of the use case ...

We can also get rid of the "deletable" file (and associated errors
renaming deletable.new -> deletable) because we can compute what's
deletable according to "what's not referenced by current segments
file."


If the segments file is written last, how does an asynchronous deleter
tell what will be part of a future index?  I guess it's doable if all
file types have sequence numbers...

Well, in my current implementation I don't have a truly asynchronousdeleter. If I did have that then you're right I'd need to not deletethe "new and in progress" files. We could consider something like thatin the future ...

Instead, I still do all deletes [synchronously] in the same places asthe current code, with the write lock held. For example, during acommit, we delete old segments immediately after writing the newsegments file, and then again after creating a compound file (if indexis using compound files). Likewise when a SegmentReader commits newdeletes/norms.

Also one neat possibility this could lead to in the future is toexplicitly keep "virtual snapshots" at points in time, but within asingle index (vs eg the hard-link snapshots that Solr does).

For example if you want to index a bunch of docs, but not make themvisible yet for searching, with the current code, you have to make surenever to restart an IndexSearcher. But if your app server goes down(say), then all IndexSearchers will come back up and make your indexeddocs visible.

But with this new approach (plus some additional code that I'm notplanning on doing for starters), it would be possible for anIndexSearcher to explicitly say "I'd like to re-open the snapshot of theindex as of 3 days ago", for example. This would require more smarts inthe reclaiming of old files ... but at least this could be a first steptowards that.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lock-less commits

Reply via email to