John:
  I had implemented a batch (delete,insert,update) operation for the
Oracle Lucene Domain Index using OJVMDirectory, see patch:
http://issues.apache.org/jira/browse/LUCENE-724
 The strategy used in this solution is to enqueue all operations on
the table which have a column indexed by Lucene and synchronize then
all the operations.
  For each operation into the index (Lucene Domain Context) do:
   on insert:
      enqueue the operation storing the rowid of the row inserted.
   on update:
      enqueue the operation storing the rowid of the row updated.
   on delete:
      open an IndexReader, delete the document using
deleteDocuments(new Term("rowid", rid));
      enqueue the operation storing the rowid of the row deleted.

    Delete operations require that the delete operation must be
performed at the time of the deletion because a following query using
the operator lcontains() can return non existing rowid.
    The sync process perform these steps:
    for each operation enqueued on the lucene_queue do {
       if operation is delete {
         remove the rowid from the lists scheduledDeleteForUpdate and
scheduledAdd
       } else {
          if operation is update {
             add the rowid to the list scheduledDeleteForUpdate
          }
          add the rowid to the list scheduledAdd
       }
    } // end do
    Open an IndexReader and delete all the rows on the list
scheduledDeleteForUpdate
    Open an IndexWriter and add all the rows on the list scheduledAdd

    Delete operations are enqueued too to eliminate rows inserted or
updated previous to the delete operation during the time between two
checkpoints. For example:
      insert a1;
      update a1;
      delete a1;
      sync(it1);
      With the above sequence the list scheduledDeleteForUpdate and
scheduledAdd, have an instance of the rowid a1, but when the loop
process the delete message remove both rowid from the two list.
      Best regards, Marcelo.

PD: OJVMDirectory and OJVMFile classes operate on Oracle BLOBs instead
of Files, so the sync operations can operate on the LOBs without
having problem of locking or copying the content to alternate
directories, because the concurrence system of the database
automatically maintain a copy of the LOBs into the undo area until the
commit is performed. When a commit is performed the OJVMDirectory
automatically remove all cached IndexReader opened by the application.
On 1/5/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
John,

Batch deletion followed by batch addition is the best practise.  Ning Li made 
some changes that improve the performance of non-batch mass delete/add 
operations, but I'm not sure what the state of those changes is (whether they 
are still in Lucene's JIRA, or whether they are in CVS).

Otis

----- Original Message ----
From: John Song <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, January 5, 2007 12:53:12 AM
Subject: efficient ways of updating document

It seems to me that updating a document is rather tedious and slow in lucene, 
especially for updating large number of documents.  Before opening an 
IndexWriter to add documents, one has to open an IndexReader/IndexSearcher to 
search for the document of a particular id.  Upon finding its docnum, delete 
it.  One then close the reader and open the writer to add the updated content.  
Move on to the next item, repeat the process. For large number of updating, the 
performance is bad.

One way to speed it up is don't do this process for every single updates.  If 
one has a large number of updates, open the reader/searcher first.  But instead 
of searching/deleting for one document and then close it, search/delete all of 
them and then close the reader/searcher.  Then one can open a writer and just 
do a batch add.

However, I wonder whether we could change the writer to do update 
automatically. Update will be consists of marking an old document invalid, but 
not removing them and adding the new document.  We can clean things up later 
on.  Using this method, there is no need of switching between reader and writer 
during update.


john



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Marcelo F. Ochoa
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to