Giulio Cesare Solaroli wrote:
Dear developers,

is there any architectural reason while an IndexWriter could not
delete a document?

There are such reasons. Maybe Doug can give additional insight. Here is what I think:

One reason I see is that there is no such thing as a unique
document id in Lucene. The IndexReader is the object through which an
index is accessed and search is also done through a reader. The document
ids used by one IndexReader/IndexSearcher instance are unique/valid only
with regard to this instance and the reader/search does not have a
possibility for changing document ids. However, by calling optimize
on an index with deletions, document ids will change. Some documents
will have other ids after calling the optimize than before. This has
no effect on an existing reader instance, only on IndexReader instances
generated after the optimize.

Of course an application can take care of unique document ids and store
them in a dedicated field. The ids could e.g. be urls If this
unique id is used for specifying a document for deletion or other terms
are used for specifying the document(s) for deletion, index access as
provided by a reader is needed to do the deletion. IndexWriter currently
does not have these capabilities.

So the only solution to the update problem is to build a wrapper around
Lucene that handles reading, writing, and updating. And this is what you
are actually doing :-)

I understand that the IndexReader (besides its strange naming for this
feature) is the right class to use to delete a document, but this
raises a huge problem for me.

We add almost 50.000 documents a day, while deleting a similar amount
of old documents over the same period.
We index new documents in batch every 5 minutes while deleting the old
ones and optimize the index twice a day, in order to keep good
performance for the queries and the number of index files under
control.

In this situation, I try to keep the same IndexWriter open as much as
possible, in order to avoid any unnecessary fragmentation of the
index.
Before indexing any document, I can check to see if the document has
already been inserted, but I am not able to delete it without closing
the IndexWriter, opening an IndexReader, deleting the document,
closing the IndexReader an opening again the IndexWritere.

This arrangement seems reasonable if updated documents are scarce, but
doesn't seem feasible to work with a high rate of updated documents.

I would prefer to avoid deleting all updated documents from the index
before opening the IndexWriter because the updating and indexing
procedure would get much more complex, and because I will introduce a
significant time gap where a previously available document is no more
available on the index.

If you want to do several updates at the same time, the most efficient way would be to:

1) Keep an IndexReader/Searcher open on your index in order to guarantee
reed access and a consistent index during the whole process.

2) Open a new IndexReader and delete all the documents that you want to
update.

3) Close the IndexReader (makes the deletions visible for any new
readers/writers but not for the still opened Searcher/Reader).

4) Open an IndexWriter and add all modified documents.

5) Close the IndexWriter (makes the insertions visible for any new
readers/writers but not for the still opened Searcher/Reader).

6) Substitute the IndexReader/Searcher with a new one to make
the changes visible.

Do you confirm my idea that keeping and IndexWriter open as much as
possible while indexing batch of documents is a "good thing"?

Yes. IndexWriter works with a RamDirectory as cache. If you close it after each document and open a new one, you enforce unnecessary write operations to your hard disk.

Is there any option to ever see a deleteDocument method in the
IndexWriter class

Probably not. I guess you either have to update every document separately as described in your email (open and close a reader and writer for each document), or do it in the way I describe above (more efficient).

Christoph








--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to