Doug
Christoph Goller wrote:
I thought things over and I now think there are two possible options for coping with the indexWriter.docCount() bug. I cannot decide this alone. Maybe voting is needed.
Problem:
writer.docCount() adds up the docCount values from segmentInfos. Note that currently only IndexWriter writes segmentInfos ("segments" file). IndexReader only reads them. The problem is that segmentInfo.docCount values are updated incorrectly in indexWriter.mergeSegments. Information about deleted documents is ignored and therefore segmentInfo.docCount values for new segments become too big and do not reflect the real size of the new segments. This has two effects. Firstly, writer.docCount() becomes incorrect, secondly the merge process is controlled by incorrect values about segment size. Note that the the docCount values from segmentInfos are used to control the merge process.
Option (A)
This is the IndexWriter patch that I submitted. This patch has the effect that segmentInfo.docCount values represent the real size of the segments. Even if a document is deleted, it is still there until the segment gets merged. For every segment the corresponding segmentInfo.docCount values delivers the same value that a reader on this segment would deliver with reader.maxDoc(). Off course this also means that for readers and writers on the whole index reader.maxDoc() == writer.docCount().
Option (B)
This option leaves IndexWriter as it was. IndexReader has to be changed. Instead of only reading segmentInfos ("segments" file) IndexReader would have to write segmentInfos if documents have been deleted. I would do that in reader.doClose. The effect would be that for every segment segmentInfo.docCount would deliver the same value that a reader on this segment would deliver with reader.numDocs(). For reader and writers on the whole index we would have reader.numDocs() == writer.docCount(). Here segmentInfo.docCount values represent the number of valid documents of a segment, those documents that have not been deleted.
I am slightly in favour of option (A) since it is less work to do :-) and it seems reasonable to use the real size of segments for controlling the merge process. However, I can also implement option (B).
Christoph
Otis Gospodnetic schrieb:
Christoph,
Thank you for expanding the coverage of the test. However, this looks wrong to me:
- assertEquals(50, writer.docCount()); + assertEquals(100, writer.docCount());
Aren't you trying to fix IndexWriter so that after adding 100 and deleting 50 documents, its docCount() method returns 50? The above suggests that the correct behaviour is to return 100, even though 50 have been deleted, and only 50 documents are left in the index.
Could you please clarify this for me, before I commit the patches to (Test)IndexWriter?
Thanks, Otis
--- Christoph Goller <[EMAIL PROTECTED]> wrote:
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
