The basic idea is to change all commits (from SegmentReader or IndexWriter) so that we never write to an existing file that a reader could be reading from. Instead, always write to a new file name using sequentially numbered files. For example, for "segments", on every commit, write to a the sequence: segments.1, segments.2, segments.3, etc. Likewise for the *.del and *.fN (norms) files that SegmentReaders write to.
Interesting idea... How do you get around races between opening and deleting? I assume for the writer, you would 1) write new segments 2) write new 'segments.3' 3) delete unused segments (those referenced by 'segments.2') But what happens when a reader comes along at point 1.5, say, opens the latest 'segments.2' file, and then tries to open some of the segments files at 3.5? I guess the reader could retry... checking for a new segments file. This could happen more than once (hopefully it wouldn't lead to starvation... that would be unlikely).
We can also get rid of the "deletable" file (and associated errors renaming deletable.new -> deletable) because we can compute what's deletable according to "what's not referenced by current segments file."
If the segments file is written last, how does an asynchronous deleter tell what will be part of a future index? I guess it's doable if all file types have sequence numbers... -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 8/18/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
I think it's possible to modify Lucene's commit process so that it does not require any commit locking at all. This would be a big win because it would prevent all the various messy errors (FileNotFound exceptions on instantiating an IndexReader, Access Denied errors on renaming X.new -> X, Lock obtain timed out from leftover lock files, etc.) that Lucene users keep coming across. Also, indices against remote (NFS, Samba) filesystems, where current locking has known issues that users seem to hit fairly often, would then be fine. I'd like to get feedback on this idea (am I missing something?) and if there are no objections I can submit a full patch. I have an initial implementation that passes all unit tests. It also runs fine with a writer/searcher stress test: the writer adding docs to an index stored on NFS, and a multi-threaded reader on a separate (Windows XP, mounted over Samba) machine continuously re-instantiating an IndexSearcher and doing a search against the same index.
Disk usage should be the same, even temporarily when merging, because we still remove the old segments after merging.
This means IndexReader, on opening an index, finds the most recent segments file and loads it. If, when loading the segments, it hits a FileNotFound exception, and a newer segments file has appeared, it re-tries against the new one. This does entail small changes to the index file format. Specifically, file names are different (they have new .N suffixes), and, the contents of the segments file is expanded to contain details about which del/norm files are current for each segment. Note that the write lock is still needed to catch people accidentally creating two writers on one index. But since this lock file isn't obtained/released as frequently as the current commit lock, I would expect fewer issues from it. This change should be fully backwards compatible, meaning the new code would read the old index format and I believe existing APIs should not change. But, if there are applications (maybe Solr?) that peek inside the index files expecting (for example) a file named "segments" to be there then such cases would need to be fixed. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]