[
https://issues.apache.org/jira/browse/LUCENE-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197383#comment-13197383
]
Robert Muir commented on LUCENE-3738:
-------------------------------------
after investigating: its difficult to prevent negative offsets, even after
fixing term vectors writer (LUCENE-3739)
At first i tried a simple assert in BaseTokenStreamTestCase:
{code}
assertTrue("offsets must not go backwards", offsetAtt.startOffset() >=
lastStartOffset);
lastStartOffset = offsetAtt.startOffset();
{code}
Then these analyzers failed:
* MockCharFilter itself had a bug, but thats easy to fix (LUCENE-3741)
* synonymsfilter failed sometimes (LUCENE-3742) because it wrote zeros for
offsets in situations like "a -> b c"
* (edge)ngramtokenizers failed, because ngrams(1,2) of "ABCD" are not A, AB, B,
BC, C, CD, D but instead A, B, C, D, AB, BC, CD, ...
* (edge)ngramfilters failed for similar reasons.
* worddelimiterfilter failed, because it doesnt break "AB" into A, AB, B but
instead A, B, AB
* trimfilter failed when 'offsets changing' is enabled, because if you have "
rob", "robert" as synonyms then it trims the first, and the second offsets "go
backwards"
These are all bugs.
In general I think offsets after being set should not be changed, because
filters don't have access to any charfilters
offset correction (correctOffset()) anyway, so they shouldnt be mucking offsets.
So really: only the creator of tokens should make the offsets. And if thats a
filter, it should be a standard way,
only inherited from existing offsets and not 'offset mathematics' and not A,
AB, B in some places and A, B, AB in others.
Really i think we need to step it up if we want highlighting to be first-class
citizen in lucene, nothing checks the offsets anyhwere at all,
even to check/assert if they are negative, and there are little tests... all we
have is some newish stuff in basetokenstreamtestcase and
a few trivial test cases.
On the other hand, for example, position increment's impl actually throws
exception if you give it something like a negative number...
> Be consistent about negative vInt/vLong
> ---------------------------------------
>
> Key: LUCENE-3738
> URL: https://issues.apache.org/jira/browse/LUCENE-3738
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3738.patch, LUCENE-3738.patch
>
>
> Today, write/readVInt "allows" a negative int, in that it will encode and
> decode correctly, just horribly inefficiently (5 bytes).
> However, read/writeVLong fails (trips an assert).
> I'd prefer that both vInt/vLong trip an assert if you ever try to write a
> negative number... it's badly trappy today. But, unfortunately, we sometimes
> rely on this... had we had this assert in 'since the beginning' we could have
> avoided that.
> So, if we can't add that assert in today, I think we should at least fix
> readVLong to handle negative longs... but then you quietly spend 9 bytes
> (even more trappy!).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]