[ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463524 ]
Michael McCandless commented on LUCENE-140: ------------------------------------------- OK from that indexing-failure.log (thanks Jed!) I can see that indeed there are segments whose maxDoc() is much smaller than deleteDocs.count(). This then leads to negative doc numbers on merging these segments. Jed when you say "there are old files (_*.cfs & _*.del) in this directory with updated timestamps that are months old" what do you mean by "with updated timestamps"? Which timestamp is months old and which one is updated? OK, assuming Jed you are indeed sending "create=false" when creating the Directory and then passing that directory to IndexWriter with create=true, I think we now have this case fully explained (thanks Doron): your old _*.del files are being incorrectly opened & re-used by Lucene, when they should not be. Lucene (all released versions but not the trunk version, see below) does a simple fileExists("_XXX.del") call to determine if a segment XXX has deletes. But when that _XXX.del is a leftover from a previous index, it very likely doesn't "match" the newly created _XXX segment. (Especially if merge factor has changed but also if order of operations has changed, which I would expect in this use case). If that file exists, Lucene assumes it's for this segment and so opens it and uses it. If it happens that this _XXX.del file has more documents in it than the newly created _XXX.cfs segment, then negative doc numbers will result (and then later cause the "docs out of order" exception). If it happens that the _XXX.del file has fewer documents than the newly created _XXX.cfs segment then you'll hit ArrayIndexOutOfBounds exceptions in calls to isDeleted(...). If they are exactly equal then you'd randomly see some of your docs got deleted. Note that the trunk version of Lucene has already fixed this bug (as part of lockless commits): * Whether a segment has deletions or not is now explictly stored in the segments file rather than relying on a "fileExists(...)" call. So, if an old _XXX.del existed in the filesystem, the newly created _XXX segment would not open it. * Furthermore, the trunk version of Lucene uses a new IndexFileDelter class to remove any unreferenced index files. This means it would have removed these old _*.cfs and _*.del files even in the case where a directory was created with "create=false" and the IndexWriter was created with "create=true". To summarize: * There was one case where if you gave slightly illegal doc numbers (within 7 of the actual maxDoc) Lucene may silently accept the call but would corrupt your index only to be seen later as an "docs out of order" IllegalStateException when the segment is merged. This was just a missing boundary case check. This case is now fixed in the trunk (you get an ArrayIndexOutOfBoundsException if doc number is too large). * There is also another case, that only happens if you have old _*.del files leftover from a previous index while re-creating a new index. The workaround is simple here: always open the Directory with create=true (or, remove the directory contents yourself before hand). (IndexWriter does this if you give it a String or File with create=true). This is really a bug in Lucene, but given that it's already fixed in the trunk, and the workaround is simple, I'm inclined to not fix it in prior releases and instead publicize the issue (I will do so on java-user). But, I will commit two additional IllegalStateException checks to the trunk when a segment is first initialized: 1) check that the two different sources of "maxDoc" (fieldsReader.size() and si.docCount) are the same, and 2) check that the number of pending deletions does not exceed maxDoc(). When an index has inconsistency I think the earlier it's detected the better. > docs out of order > ----------------- > > Key: LUCENE-140 > URL: https://issues.apache.org/jira/browse/LUCENE-140 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: unspecified > Environment: Operating System: Linux > Platform: PC > Reporter: legez > Assigned To: Michael McCandless > Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, > indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch > > > Hello, > I can not find out, why (and what) it is happening all the time. I got an > exception: > java.lang.IllegalStateException: docs out of order > at > org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219) > at > org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191) > at > org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172) > at > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341) > at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250) > at Optimize.main(Optimize.java:29) > It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not > find > it neither in download nor in version list in this form). Everything seems > OK. I > can search through index, but I can not optimize it. Even worse after this > exception every time I add new documents and close IndexWriter new segments is > created! I think it has all documents added before, because of its size. > My index is quite big: 500.000 docs, about 5gb of index directory. > It is _repeatable_. I drop index, reindex everything. Afterwards I add a few > docs, try to optimize and receive above exception. > My documents' structure is: > static Document indexIt(String id_strony, Reader reader, String > data_wydania, > String id_wydania, String id_gazety, String data_wstawienia) > { > Document doc = new Document(); > doc.add(Field.Keyword("id", id_strony )); > doc.add(Field.Keyword("data_wydania", data_wydania)); > doc.add(Field.Keyword("id_wydania", id_wydania)); > doc.add(Field.Text("id_gazety", id_gazety)); > doc.add(Field.Keyword("data_wstawienia", data_wstawienia)); > doc.add(Field.Text("tresc", reader)); > return doc; > } > Sincerely, > legez -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]