Jason Rutherglen wrote:
One of the bottlenecks I have noticed testing Ocean realtime search
is the delete process which involves writing several files for each
possibly single delete of a document in SegmentReader. The best way
to handle the deletes is too simply keep them in memory without
flushing them to disk, saving on writing out an entire BitVector per
delete. The deletes are saved in the transaction log which is be
replayed on recovery.
I am not sure of the best way to approach this, perhaps it is
creating a custom class that inherits from SegmentReader. It could
reuse the existing reopen and also provide a way to set the
deletedDocs BitVector. Also it would be able to reuse FieldsReader
by providing locking around FieldsReader for all SegmentReaders of
the segment to use. Otherwise in the current architecture each new
SegmentReader opens a new FieldsReader which is non-optimal. The
deletes would be saved to disk but instead of per delete,
periodically like a checkpoint.
Or ... maybe you could do the deletes through IndexWriter (somehow, if
we can get docIDs properly) and then SegmentReaders could somehow tap
into the buffered deleted docIDs that IndexWriter already maintains.
IndexWriter is already doing this buffering, flush/commit anyway.
We've also discussed at one point creating an IndexReader impl that
searches the RAM buffer that DocumentsWriter writes to when adding
documents. I think it's easier than it sounds, on first glance,
because DocumentsWriter is in fact writing the postings in nearly the
same format as is used when the segment is flushed.
So if we had this IndexReader impl, plus extended SegmentReader so it
could tap into pending deletes buffered in IndexWriter, you could get
realtime search without having to use Directory as an intermediary.
Though, it is clearly quite a bit more work :)
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]