Jason Rutherglen wrote:

One of the bottlenecks I have noticed testing Ocean realtime search is the delete process which involves writing several files for each possibly single delete of a document in SegmentReader. The best way to handle the deletes is too simply keep them in memory without flushing them to disk, saving on writing out an entire BitVector per delete. The deletes are saved in the transaction log which is be replayed on recovery.

I am not sure of the best way to approach this, perhaps it is creating a custom class that inherits from SegmentReader. It could reuse the existing reopen and also provide a way to set the deletedDocs BitVector. Also it would be able to reuse FieldsReader by providing locking around FieldsReader for all SegmentReaders of the segment to use. Otherwise in the current architecture each new SegmentReader opens a new FieldsReader which is non-optimal. The deletes would be saved to disk but instead of per delete, periodically like a checkpoint.

Or ... maybe you could do the deletes through IndexWriter (somehow, if we can get docIDs properly) and then SegmentReaders could somehow tap into the buffered deleted docIDs that IndexWriter already maintains. IndexWriter is already doing this buffering, flush/commit anyway.

We've also discussed at one point creating an IndexReader impl that searches the RAM buffer that DocumentsWriter writes to when adding documents. I think it's easier than it sounds, on first glance, because DocumentsWriter is in fact writing the postings in nearly the same format as is used when the segment is flushed.

So if we had this IndexReader impl, plus extended SegmentReader so it could tap into pending deletes buffered in IndexWriter, you could get realtime search without having to use Directory as an intermediary. Though, it is clearly quite a bit more work :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to