Also, look at the IndexModifier class, which hides a number of the ugly
details. Under the covers, I think (although I haven't looked) that it
pretty much does exactly what you outlined, at least the recommendation that
you do your deletes and adds in batches leads me to think so.

Best
Erick

On 1/5/07, Marcelo Ochoa <[EMAIL PROTECTED]> wrote:

John:
   I had implemented a batch (delete,insert,update) operation for the
Oracle Lucene Domain Index using OJVMDirectory, see patch:
http://issues.apache.org/jira/browse/LUCENE-724
  The strategy used in this solution is to enqueue all operations on
the table which have a column indexed by Lucene and synchronize then
all the operations.
   For each operation into the index (Lucene Domain Context) do:
    on insert:
       enqueue the operation storing the rowid of the row inserted.
    on update:
       enqueue the operation storing the rowid of the row updated.
    on delete:
       open an IndexReader, delete the document using
deleteDocuments(new Term("rowid", rid));
       enqueue the operation storing the rowid of the row deleted.

     Delete operations require that the delete operation must be
performed at the time of the deletion because a following query using
the operator lcontains() can return non existing rowid.
     The sync process perform these steps:
     for each operation enqueued on the lucene_queue do {
        if operation is delete {
          remove the rowid from the lists scheduledDeleteForUpdate and
scheduledAdd
        } else {
           if operation is update {
              add the rowid to the list scheduledDeleteForUpdate
           }
           add the rowid to the list scheduledAdd
        }
     } // end do
     Open an IndexReader and delete all the rows on the list
scheduledDeleteForUpdate
     Open an IndexWriter and add all the rows on the list scheduledAdd

     Delete operations are enqueued too to eliminate rows inserted or
updated previous to the delete operation during the time between two
checkpoints. For example:
       insert a1;
       update a1;
       delete a1;
       sync(it1);
       With the above sequence the list scheduledDeleteForUpdate and
scheduledAdd, have an instance of the rowid a1, but when the loop
process the delete message remove both rowid from the two list.
       Best regards, Marcelo.

PD: OJVMDirectory and OJVMFile classes operate on Oracle BLOBs instead
of Files, so the sync operations can operate on the LOBs without
having problem of locking or copying the content to alternate
directories, because the concurrence system of the database
automatically maintain a copy of the LOBs into the undo area until the
commit is performed. When a commit is performed the OJVMDirectory
automatically remove all cached IndexReader opened by the application.
On 1/5/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> John,
>
> Batch deletion followed by batch addition is the best practise.  Ning Li
made some changes that improve the performance of non-batch mass delete/add
operations, but I'm not sure what the state of those changes is (whether
they are still in Lucene's JIRA, or whether they are in CVS).
>
> Otis
>
> ----- Original Message ----
> From: John Song <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Friday, January 5, 2007 12:53:12 AM
> Subject: efficient ways of updating document
>
> It seems to me that updating a document is rather tedious and slow in
lucene, especially for updating large number of documents.  Before opening
an IndexWriter to add documents, one has to open an
IndexReader/IndexSearcher to search for the document of a particular
id.  Upon finding its docnum, delete it.  One then close the reader and open
the writer to add the updated content.  Move on to the next item, repeat the
process. For large number of updating, the performance is bad.
>
> One way to speed it up is don't do this process for every single
updates.  If one has a large number of updates, open the reader/searcher
first.  But instead of searching/deleting for one document and then close
it, search/delete all of them and then close the reader/searcher.  Then one
can open a writer and just do a batch add.
>
> However, I wonder whether we could change the writer to do update
automatically. Update will be consists of marking an old document invalid,
but not removing them and adding the new document.  We can clean things up
later on.  Using this method, there is no need of switching between reader
and writer during update.
>
>
> john
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Marcelo F. Ochoa
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to