[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214 ]
Jason Rutherglen commented on LUCENE-1476: ------------------------------------------ M.M.:" I think the transactions layer would also sit on top of this "realtime" layer? EG this "realtime" layer would expose a commit() method, and the transaction layer above it would maintain the transaction log, periodically calling commit() and truncating the transaction log?" One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs. {code} IndexWriter iw = new IndexWriter(); Transaction tr = iw.createTransaction(); tr.addDocument(new Document()); tr.addDocument(new Document()); tr.deleteDocument(1200l); IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index) IndexReader latestReader = iw.getReader(); // same as ir iw.commit(boolean doWait); // commits the in ram realtime index to disk {code} When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down. M.H.: "how about writing a single-file Directory implementation?" I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime. M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version? Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay. M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory" When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..." Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir. One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening. When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter. M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when it wants to actually perform the actual disk i/o, how expensive is flushing? Can we make it cheap enough to meet Jason's absolute change rate requirements?" When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS. M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock" In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter. > BitVector implement DocIdSet > ---------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: LUCENE-1476.patch > > Original Estimate: 12h > Remaining Estimate: 12h > > BitVector can implement DocIdSet. This is for making > SegmentReader.deletedDocs pluggable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org