[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Jason Rutherglen (JIRA) Thu, 08 Jan 2009 17:41:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214
 ]


Jason Rutherglen commented on LUCENE-1476:
------------------------------------------

M.M.:" I think the transactions layer would also sit on top of this
"realtime" layer? EG this "realtime" layer would expose a commit()
method, and the transaction layer above it would maintain the
transaction log, periodically calling commit() and truncating the
transaction log?"

One approach that may be optimal is to expose from IndexWriter a 
createTransaction method that accepts new documents and deletes.  All documents 
have an associated UID.  The new documents could feasibly be encoded into a 
single segment that represents the added documents for that transaction.  The 
deletes would be represented as document long UIDs rather than int doc ids.  
Then the commit method would be called on the transaction object who returns a 
reader representing the latest version of the index, plus the changes created 
by the transaction.  This system would be a part of IndexWriter and would not 
rely on a transaction log.  IndexWriter.commit would flush the in ram realtime 
indexes to disk.  The realtime merge policy would flush based on the RAM usage 
or number of docs.

{code}
IndexWriter iw = new IndexWriter();
Transaction tr = iw.createTransaction();
tr.addDocument(new Document());
tr.addDocument(new Document());
tr.deleteDocument(1200l);
IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a 
ram index)
IndexReader latestReader = iw.getReader(); // same as ir
iw.commit(boolean doWait); // commits the in ram realtime index to disk
{code}

When commit is called, the disk segment reader flush their deletes to disk 
which is fast.  The in ram realtime index is merged to disk.  The process is 
described in more detail further down.

M.H.: "how about writing a single-file Directory implementation?"

I'm not sure we need this because and appending rolling transaction log should 
work.  Segments don't change, only things like norms and deletes which can be 
appended to a rolling transaction log file system.  If we had a generic 
transaction logging system, the future column stride fields, deletes, norms, 
and future realtime features could use it and be realtime.  

M.H.: "How do you guarantee that you always see the "current" version of a 
given document, and only that version? 

Each transaction returns an IndexReader.  Each "row" or "object" could use a 
unique id in the transaction log model which would allow documents that were 
merged into other segments to be deleted during a transaction log replay.  

M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose 
new deletes in the FSDirectory"

When do you expose new deletes in the RAMDir, when do you expose new deletes in 
the FSDirectory, how do you manage slow merges from the RAMDir to the 
FSDirectory, how do you manage new adds to the RAMDir that take place during 
slow merges..."

Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the 
background, perform the deletes after the copy is completed, then instantiate a 
new reader with the newly merged FSDirectory and a new RAMDirs.  Writes that 
were occurring during this process would be happening to another new RAMDir.  

One way to think of the realtime problem is in terms of segments rather than 
FSDirs and RAMDirs.  Some segments are on disk, some in RAM.  Each transaction 
is an instance of some segments and their deletes (and we're not worried about 
the deletes being flushed or not so assume they exist as BitVectors).  The 
system should expose an API to checkpoint/flush at a given transaction level 
(usually the current) and should not stop new updates from happening.    

When I wrote this type of system, I managed individual segments outside of 
IndexWriter's merge policy and performed the merging manually by placing each 
segment in it's own FSDirectory (the segment size was 64MB) which minimized the 
number of directories.  I do not know the best approach for this when performed 
within IndexWriter.  

M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS 
when
it wants to actually perform the actual disk i/o, how expensive is flushing? Can
we make it cheap enough to meet Jason's absolute change rate requirements?"

When I tried out the transaction log a write usually mapped pretty quickly to a 
hard disk write.  I don't think it's safe to leave writes up to the OS.

M.M.: "maintain & updated deleted docs even though IndexWriter has the write 
lock"

In my previous realtime search implementation I got around this by having each 
segment in it's own directory.  Assuming this is non-optimal, we will need to 
expose an IndexReader that has the writelock of the IndexWriter.    


> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to