[ 
https://issues.apache.org/jira/browse/LUCENE-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909686#action_12909686
 ] 

Simon Willnauer commented on LUCENE-2622:
-----------------------------------------

It seems that we figured out whats going on here. The problem seem to be the 
optimization done in LUCENE-2588 where we strip off the non-distinguishing 
suffix to save RAM in the loaded terms index. The problem with this 
optimization is that it is not safe for all comparators. The testcase runs with 
a reverse unicode comparator which triggers terms to appear in reverse order 
during indexing. 
Yet, this is not a problem until we have run into the situations where the the 
stripped suffix is required due to the nature of the comparator. In this case 
here we index number  from 0 - 173 and with the randomly set termIndexInterval 
set to 54 we run into a situation where the indexing code was wrong about the 
prefix. It sees the term "49" with prior term "5" and thinks it could strip of 
the "9" from the previous term and uses "4" as the indexed term. 

Once we seek on the terms dictionary the binary search in 
CoreFieldIndex#getIndexOffset we try to find the indexedTerm prior to term "44" 
we compare to "4" which returns -1 while comparing to "49" would have yield 1. 
That lets us end up with the wrong offset and the assert blows up.

We somehow need to have access to the actually used comparator during building 
the indexed terms to fix that - I will reopen LUCENE-2588

> Random Test Failure org.apache.lucene.TestExternalCodecs.testPerFieldCodec 
> (from TestExternalCodecs)
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2622
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Tests
>            Reporter: Mark Miller
>            Priority: Minor
>
> Error Message
> state.ord=54 startOrd=0 ir.isIndexTerm=true state.docFreq=1
> Stacktrace
> junit.framework.AssertionFailedError: state.ord=54 startOrd=0 
> ir.isIndexTerm=true state.docFreq=1
>       at 
> org.apache.lucene.index.codecs.standard.StandardTermsDictReader$FieldReader$SegmentTermsEnum.seek(StandardTermsDictReader.java:395)
>       at 
> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1099)
>       at 
> org.apache.lucene.index.DocumentsWriter.applyDeletes(DocumentsWriter.java:1028)
>       at 
> org.apache.lucene.index.IndexWriter.applyDeletes(IndexWriter.java:4213)
>       at 
> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3381)
>       at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3221)
>       at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3211)
>       at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2345)
>       at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2323)
>       at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2293)
>       at 
> org.apache.lucene.TestExternalCodecs.testPerFieldCodec(TestExternalCodecs.java:645)
>       at 
> org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:381)
>       at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:373)
> Standard Output
> NOTE: random codec of testcase 'testPerFieldCodec' was: 
> MockFixedIntBlock(blockSize=1327)
> NOTE: random locale of testcase 'testPerFieldCodec' was: lt_LT
> NOTE: random timezone of testcase 'testPerFieldCodec' was: Africa/Lusaka
> NOTE: random seed of testcase 'testPerFieldCodec' was: 812019387131615618

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to