[
https://issues.apache.org/jira/browse/LUCENE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151598#comment-14151598
]
Robert Muir commented on LUCENE-5977:
-------------------------------------
The bug is older than 4.9 actually: it existed in the previous indexing chain,
too.
I ran the test program against 4.8, even with assertingcodec, it just silently
"succeeds".
But if you add checkindex call:
{noformat}
Segments file=segments_1 numSegments=1 version=4.8 format=
1 of 1: name=_0 docCount=1
codec=Asserting
compound=false
numFiles=12
size (MB)=0.001
diagnostics = {timestamp=1411990327375, os=Linux,
os.version=3.13.0-24-generic, source=flush, lucene.version=4.8-SNAPSHOT,
os.arch=amd64, java.version=1.7.0_55, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [1 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...ERROR: java.lang.RuntimeException: term [75 73
65]: doc 0: pos 1: startOffset -2147483597 is out of bounds
java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset
-2147483597 is out of bounds
at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:944)
at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1278)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:626)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
test: stored fields.......OK [0 total field count; avg 0 fields per doc]
test: term vectors........ERROR [vector term=[75 73 65] field=field-foo
doc=0: startOffset=51 differs from postings startOffset=-2147483597]
java.lang.RuntimeException: vector term=[75 73 65] field=field-foo doc=0:
startOffset=51 differs from postings startOffset=-2147483597
at
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1748)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:632)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
SORTED; 0 SORTED_SET]
FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
java.lang.RuntimeException: Term Index test failed
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:641)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
WARNING: 1 broken segments (containing 1 documents) detected
{noformat}
> IW should safeguard against token streams returning invalid offsets for
> multi-valued fields
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-5977
> URL: https://issues.apache.org/jira/browse/LUCENE-5977
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 4.9, 4.9.1, 4.10, 4.10.1
> Reporter: Dawid Weiss
> Priority: Minor
>
> We have a custom token stream that emits information about offsets of each
> token. My (wrong) assumption was that for multi-valued fields a token
> stream's offset information is magically shifted, much like this is the case
> with positions. It's not the case -- offsets should be increasing and
> monotonic across all instances of a field, even if it has custom token
> streams. So, something like this:
> {code}
> doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1,
> 150, 160)), ftype));
> doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1,
> 50, 60)), ftype));
> {code}
> where the token function is defined as:
> {code}
> token(String image, int positionIncrement, int startOffset, int endOffset)
> {code}
> will result in either a cryptic assertion thrown from IW:
> {code}
> Exception in thread "main" java.lang.AssertionError
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeOffsets(FreqProxTermsWriterPerField.java:99)
> {code}
> or nothing (or a codec error) if run without assertions.
> Obviously returning non-shifted offsets from subsequent token streams makes
> little sense but I wonder if it could be made more explicit (or asserted)
> that offsets need to be increasing between multiple-values. The minimum is to
> add some documentation to OffsetAttribute. I don't know if offsets should be
> shifted automatically, as it's the case with positions -- this would change
> the semantics of existing tokenizers and filters which implement such
> shifting internally already.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]