[jira] [Commented] (LUCENE-5977) IW should safeguard against token streams returning invalid offsets for multi-valued fields

Robert Muir (JIRA) Mon, 29 Sep 2014 04:36:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151598#comment-14151598
 ]


Robert Muir commented on LUCENE-5977:
-------------------------------------

The bug is older than 4.9 actually: it existed in the previous indexing chain, 
too.

I ran the test program against 4.8, even with assertingcodec, it just silently 
"succeeds".

But if you add checkindex call:
{noformat}
Segments file=segments_1 numSegments=1 version=4.8 format=
  1 of 1: name=_0 docCount=1
    codec=Asserting
    compound=false
    numFiles=12
    size (MB)=0.001
    diagnostics = {timestamp=1411990327375, os=Linux, 
os.version=3.13.0-24-generic, source=flush, lucene.version=4.8-SNAPSHOT, 
os.arch=amd64, java.version=1.7.0_55, java.vendor=Oracle Corporation}
    no deletions
    test: open reader.........OK
    test: check integrity.....OK
    test: check live docs.....OK
    test: fields..............OK [1 fields]
    test: field norms.........OK [0 fields]
    test: terms, freq, prox...ERROR: java.lang.RuntimeException: term [75 73 
65]: doc 0: pos 1: startOffset -2147483597 is out of bounds
java.lang.RuntimeException: term [75 73 65]: doc 0: pos 1: startOffset 
-2147483597 is out of bounds
        at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:944)
        at org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1278)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:626)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
    test: stored fields.......OK [0 total field count; avg 0 fields per doc]
    test: term vectors........ERROR [vector term=[75 73 65] field=field-foo 
doc=0: startOffset=51 differs from postings startOffset=-2147483597]
java.lang.RuntimeException: vector term=[75 73 65] field=field-foo doc=0: 
startOffset=51 differs from postings startOffset=-2147483597
        at 
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1748)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:632)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_SET]
FAILED
    WARNING: fixIndex() would remove reference to this segment; full exception:
java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:641)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:199)
        at org.apache.lucene.util.TestUtil.checkIndex(TestUtil.java:191)
        at org.apache.lucene.OffsetIndexingBug.main(OffsetIndexingBug.java:48)

WARNING: 1 broken segments (containing 1 documents) detected
{noformat}

> IW should safeguard against token streams returning invalid offsets for 
> multi-valued fields
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5977
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5977
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.9, 4.9.1, 4.10, 4.10.1
>            Reporter: Dawid Weiss
>            Priority: Minor
>
> We have a custom token stream that emits information about offsets of each 
> token. My (wrong) assumption was that for multi-valued fields a token 
> stream's offset information is magically shifted, much like this is the case 
> with positions. It's not the case -- offsets should be increasing and 
> monotonic across all instances of a field, even if it has custom token 
> streams. So, something like this:
> {code}
>         doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1, 
> 150, 160)), ftype));
>         doc.add(new Field("field-foo", new CannedTokenStream(token("bar", 1,  
> 50,  60)), ftype));
> {code}
> where the token function is defined as:
> {code}
> token(String image, int positionIncrement, int startOffset, int endOffset)
> {code}
> will result in either a cryptic assertion thrown from IW:
> {code}
> Exception in thread "main" java.lang.AssertionError
>       at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeOffsets(FreqProxTermsWriterPerField.java:99)
> {code}
> or nothing (or a codec error) if run without assertions.
> Obviously returning non-shifted offsets from subsequent token streams makes 
> little sense but I wonder if it could be made more explicit (or asserted) 
> that offsets need to be increasing between multiple-values. The minimum is to 
> add some documentation to OffsetAttribute. I don't know if offsets should be 
> shifted automatically, as it's the case with positions -- this would change 
> the semantics of existing tokenizers and filters which implement such 
> shifting internally already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5977) IW should safeguard against token streams returning invalid offsets for multi-valued fields

Reply via email to