writer.docCount() adds up the docCount values from segmentInfos. The problem
is that currently these values are not updated if documents get deleted and that
the values for new segments during merge are taken from the old segmentInfos.
My patch makes writer.docCount() deliver the same results as reader.maxDoc(),
which reflects deletion of documents in a segment not before the segment is
merged. This is the difference to reader.numDocs() that is updated immediately.
Look what is tested after deleting 50 documents:

          writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), false);
          assertEquals(100, writer.docCount()); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
          writer.close();

          reader = IndexReader.open(dir);
          assertEquals(100, reader.maxDoc());    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
          assertEquals(50, reader.numDocs());
          reader.close();

          writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), false);
          writer.optimize();
          assertEquals(50, writer.docCount());
          writer.close();

          reader = IndexReader.open(dir);
          assertEquals(50, reader.maxDoc());
          assertEquals(50, reader.numDocs());
          reader.close();


Maybe I did not think far enough. Writer.docCount() could deliver the same values as reader.numDocs(). However, this would require changes in IndexReader. reader.doClose() would have to change values in segmentInfos. Currently IndexReader only reads segmentInfos. This also makes sense with respect to merging if we keep in mind that the docCount values in segmentInfos are used for controlling the merge process. I think about it, maybe I submit another patch soon. Please wait a little bit with committing my IndexWriter patch. Maybe it will become obsolete.

Christoph


Otis Gospodnetic schrieb:
Christoph,

Thank you for expanding the coverage of the test.
However, this looks wrong to me:

-          assertEquals(50, writer.docCount());
+          assertEquals(100, writer.docCount());

Aren't you trying to fix IndexWriter so that after adding 100 and
deleting 50 documents, its docCount() method returns 50?
The above suggests that the correct behaviour is to return 100, even
though 50 have been deleted, and only 50 documents are left in the
index.

Could you please clarify this for me, before I commit the patches to
(Test)IndexWriter?

Thanks,
Otis


--- Christoph Goller <[EMAIL PROTECTED]> wrote:


Sorry, here is the patch.

Otis Gospodnetic schrieb:

Christoph,

The idea looks good, but the test fails for both pre-patched as

well as


patched version of IndexWriter.

I converted your test to JUnit test and will check it into CVS

shortly.


If I made a mistake in it, please point it out.
You can run 'ant test-unit' to see where the test fails.

Otis

--- Christoph Goller <[EMAIL PROTECTED]> wrote:


IndexWriter implements the method docCount() which reads the number
of documents from the SegmentInfos of the index. However, it

delivers


incorrect values if documents get deleted from the index. The

reason


for
this is that SegmentInfo.docCounts are updated in an incorrect way
when
segments get merged. The new value is taken from the old
SegmentInfos.
It would be better to take the value from the reader instead. In

this


way indexWriter.docCount() would deliver the same value as
indexReader.maxDoc().

test and patch are attached,
Christoph


-- ***************************************************************** * Dr. Christoph Goller Tel.: +49 89 203 45734 * * Detego Software GmbH Mobile: +49 179 1128469 * * Keuslinstr. 13 Fax.: +49 721 151516176 * * 80798 M�nchen, Germany Email: [EMAIL PROTECTED] * *****************************************************************


Index: IndexWriter.java

=================================================================== RCS file:



/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v

retrieving revision 1.14
diff -u -r1.14 IndexWriter.java
--- IndexWriter.java    12 Aug 2003 15:05:03 -0000      1.14
+++ IndexWriter.java    3 Sep 2003 14:55:33 -0000
@@ -355,7 +355,7 @@
     if ((reader.directory == this.directory) || // if we own the
directory
         (reader.directory == this.ramDirectory))
        segmentsToDelete.addElement(reader);      // queue segment for
deletion
-      mergedDocCount += si.docCount;
+      mergedDocCount += reader.numDocs();
   }
   if (infoStream != null) {
     infoStream.println();


import java.io.IOException;

import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory;

/*
* Created on 03.09.2003
*
* To change the template for this generated file go to
* Window>Preferences>Java>Code Generation>Code and Comments
*/

/**
* * @author goller
*/
public class IndexWriterDocCountTest {
int docCount = 0;


void addDoc(IndexWriter writer)
{
Document doc = new Document();
doc.add(Field.Keyword("id","id" + docCount));
doc.add(Field.UnStored("content","aaa"));
try {
writer.addDocument(doc);
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
docCount++;
}


public static void main(String[] args) {
Directory dir = new RAMDirectory();
IndexWriterDocCountTest test = new

IndexWriterDocCountTest();


IndexWriter writer = null;
IndexReader reader = null;
int i;
try {
writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
true);
for (i = 0; i < 100; i++)
test.addDoc(writer);
System.out.println("docCount: " + writer.docCount());
writer.close();
reader = IndexReader.open(dir);
for (i = 0; i < 50; i++)
reader.delete(i);
reader.close();
System.out.println("doc #0-49 deleted");
writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
false);
System.out.println("docCount: " + writer.docCount());
writer.optimize();
System.out.println("optimized called");
System.out.println("docCount: " + writer.docCount());
writer.close();
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}




---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



__________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com



---------------------------------------------------------------------


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-- ***************************************************************** * Dr. Christoph Goller Tel.: +49 89 203 45734 * * Detego Software GmbH Mobile: +49 179 1128469 * * Keuslinstr. 13 Fax.: +49 721 151516176 * * 80798 M�nchen, Germany Email: [EMAIL PROTECTED] * *****************************************************************

Index: TestIndexWriter.java

=================================================================== RCS file:


/home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/TestIndexWriter.java,v


retrieving revision 1.1
diff -u -r1.1 TestIndexWriter.java
--- TestIndexWriter.java        10 Sep 2003 12:58:37 -0000      1.1
+++ TestIndexWriter.java        10 Sep 2003 16:29:31 -0000
@@ -47,10 +47,23 @@
          reader.close();

writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
false);
- assertEquals(50, writer.docCount());
+ assertEquals(100, writer.docCount());
+ writer.close();
+ + reader = IndexReader.open(dir);
+ assertEquals(100, reader.maxDoc());
+ assertEquals(50, reader.numDocs());
+ reader.close();
+ + writer = new IndexWriter(dir, new WhitespaceAnalyzer(),
false);
writer.optimize();
assertEquals(50, writer.docCount());
writer.close();
+ + reader = IndexReader.open(dir);
+ assertEquals(50, reader.maxDoc());
+ assertEquals(50, reader.numDocs());
+ reader.close();
}
catch (IOException e) {
e.printStackTrace();



---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



__________________________________ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-- ***************************************************************** * Dr. Christoph Goller Tel.: +49 89 203 45734 * * Detego Software GmbH Mobile: +49 179 1128469 * * Keuslinstr. 13 Fax.: +49 721 151516176 * * 80798 M�nchen, Germany Email: [EMAIL PROTECTED] * *****************************************************************


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to