Christoph,
Option (A) seems to be a better thing to do after all.
Isn't that the IndexWriter patch that you ... ah, yes, you say that
yourself below.
Thanks again, I'll commit the patched IndexWriter now.
Otis
--- Christoph Goller <[EMAIL PROTECTED]> wrote:
> I thought things over and I now think there are two possible options
> for coping with the indexWriter.docCount() bug. I cannot decide this
> alone. Maybe voting is needed.
>
> Problem:
>
> writer.docCount() adds up the docCount values from segmentInfos.
> Note that currently only IndexWriter writes segmentInfos ("segments"
> file).
> IndexReader only reads them. The problem is that segmentInfo.docCount
> values are updated incorrectly in indexWriter.mergeSegments.
> Information
> about deleted documents is ignored and therefore segmentInfo.docCount
> values for new segments become too big and do not reflect the real
> size
> of the new segments. This has two effects. Firstly, writer.docCount()
> becomes incorrect, secondly the merge process is controlled by
> incorrect
> values about segment size. Note that the the docCount values from
> segmentInfos are used to control the merge process.
>
> Option (A)
>
> This is the IndexWriter patch that I submitted. This patch has the
> effect
> that segmentInfo.docCount values represent the real size of the
> segments.
> Even if a document is deleted, it is still there until the segment
> gets
> merged. For every segment the corresponding segmentInfo.docCount
> values
> delivers the same value that a reader on this segment would deliver
> with
> reader.maxDoc(). Off course this also means that for readers and
> writers
> on the whole index reader.maxDoc() == writer.docCount().
>
> Option (B)
>
> This option leaves IndexWriter as it was. IndexReader has to be
> changed.
> Instead of only reading segmentInfos ("segments" file) IndexReader
> would
> have to write segmentInfos if documents have been deleted. I would do
> that
> in reader.doClose. The effect would be that for every segment
> segmentInfo.docCount would deliver the same value that a reader on
> this
> segment would deliver with reader.numDocs(). For reader and writers
> on the
> whole index we would have reader.numDocs() == writer.docCount(). Here
> segmentInfo.docCount values represent the number of valid documents
> of a
> segment, those documents that have not been deleted.
>
> I am slightly in favour of option (A) since it is less work to do :-)
> and
> it seems reasonable to use the real size of segments for controlling
> the
> merge process. However, I can also implement option (B).
>
> Christoph
>
> Otis Gospodnetic schrieb:
> > Christoph,
> >
> > Thank you for expanding the coverage of the test.
> > However, this looks wrong to me:
> >
> > - assertEquals(50, writer.docCount());
> > + assertEquals(100, writer.docCount());
> >
> > Aren't you trying to fix IndexWriter so that after adding 100 and
> > deleting 50 documents, its docCount() method returns 50?
> > The above suggests that the correct behaviour is to return 100,
> even
> > though 50 have been deleted, and only 50 documents are left in the
> > index.
> >
> > Could you please clarify this for me, before I commit the patches
> to
> > (Test)IndexWriter?
> >
> > Thanks,
> > Otis
> >
> >
> > --- Christoph Goller <[EMAIL PROTECTED]> wrote:
> >
>
> --
> *****************************************************************
> * Dr. Christoph Goller Tel.: +49 89 203 45734 *
> * Detego Software GmbH Mobile: +49 179 1128469 *
> * Keuslinstr. 13 Fax.: +49 721 151516176 *
> * 80798 M�nchen, Germany Email: [EMAIL PROTECTED] *
> *****************************************************************
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]