i don't think these changes are going to work. With multiple writers and or readers doing deletes, without serializing the writes you will have inconsistencies - and the del files will need to be unioned.

That is:

station A opens the index
station B opens the index
station A deletes some documents creating segment.del1
station B deletes some documents creating segment.del2

when station C opens the index (or when the segment is merged) del1 and del2 need to be merged.

The locking enforces that writers are serialized - you cannot remove this restriction unless you merge the writes when reading.


On Aug 18, 2006, at 1:41 PM, Michael McCandless wrote:


It could in theory lead to starvation but this should be rare in
practice unless you have an IndexWriter that's constantly committing.
An index with a small mergeFactor (say 2) and a small maxBufferedDocs
(default 10), would have segments deleted every
mergeFactor*maxBufferedDocs when rapidly adding documents.  It might
help to start opening segments with the *last* segment, where segment
deletions are most likely to happen.

That is true. I like the idea of opening last segments first -- I'll do
that.

Also, when loading a .del file, how would one tell if it didn't exist
or if it was just deleted?
I guess one would always need to write a .del file even if no docs
were deleted.  Or, one could just order the deletes (delete optional
files in a segment last).

Right, in order to handle this, I've modified the segments file to
also contain the current "generation" (the .N suffix) of each
segment's .del & norms suffixes.  This way when SegmentReader reads
the segment, it knows exactly which del/norms files it's supposed to
find.  For "doUndeleteAll()" I write a zero-length .del.N+1 file.
SegmentReader is already writing a new segments file when it commits
(in today's code).

One would also have to worry about partially deleted segments on
Windows... while removing a segment, some of the files might fail to
delete (due to still being open) and some might succeed.

Yes, I think this case is handled correctly.  Once all searchers using
those old segments are closed, then the next commit that runs will
remove those files (just like it does today).

Not having to read/write the deletable file should make things more
robust (there was a thread recently on users list about hitting an
exception because deletable.new couldn't be deleted on Windows).

This idea is worth kicking around more for the future (maybe for when
the index format changes again), but it's probably too much change for
right now (Lucene 2.0.x), right?

Yes I don't think this should go in for a 2.0.x point release.  Maybe
for a 2.1.x?  Or I guess whenever we next have a major enough release
to allow changing of the index format.

I do think the benefits are sizable, though, so we should not wait too
too long :) The number of poor people who post to the users list with
errant Access Denied, FileNotFound, lock obtain timed out, etc.,
exceptions is quite large.  There was just one today that I'm going to
go try to respond to next.  Plus the prospect of working just fine on
remote filesystems is great!

OK I will keep working through this & running stress tests on it to
see if I can uncover any issues...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to