[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Jason Rutherglen (JIRA) Thu, 08 Jan 2009 08:47:33 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662033#action_12662033
 ]


Jason Rutherglen commented on LUCENE-1476:
------------------------------------------

Marvin: "The whole tombstone idea arose out of the need for (close to) realtime 
search! It's intended to improve write speed."

It does improve the write speed.  I found in making the realtime search write 
speed fast enough that writing to individual files per segment can become too 
costly (they accumulate fast, appending to a single file is faster than 
creating new files, deleting the files becomes costly).  For example, writing 
to small individual files per commit, if the number of segments is large and 
the delete spans multiple segments will generate many files.  This is variable 
based on how often the updates are expected to occur.  I modeled this after the 
extreme case of the frequency of updates of a MySQL instance backing data for a 
web application.

The MySQL design, translated to Lucene is a transaction log per index.  Where 
the updates consisting of documents and deletes are written to the transaction 
log file.  If Lucene crashed for some reason the transaction log would be 
replayed.  The in memory indexes and newly deleted document bitvectors would be 
held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted 
documents) manually or based on memory usage.  Many users may not want a 
transaction log as they may be storing the updates in a separate SQL database 
instance (this is the case where I work) and so a transaction log is redundant 
and should be optional.  The first implementation of this will not have a 
transaction log.

Marvin: "I don't think I understand. Is this the "combination index 
reader/writer" model, where the writer prepares a data structure that then gets 
handed off to the reader?"

It would be exposed as a combination reader writer that manages the transaction 
status of each update.  The internal architecture is such that after each 
update a new reader representing the new documents and deletes for the 
transaction is generated and put onto a stack.  The reader stack is drained 
based on whether a reader is too old to be useful anymore (i.e. no references 
to it, or it's has N number of readers ahead of it).  

> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

Reply via email to