[
https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-3642:
--------------------------------
Attachment: LUCENE-3642_test.patch
here's a test.
the problem is a previous filter 'lengthens' this term by folding æ -> ae, but
EdgeNGramFilter computes the offsets "additively": offsetAtt.setOffset(tokStart
+ start, tokStart + end);
Because of this if a word has been 'lengthened' by a previous filter, edgengram
will produce offsets that are longer than the original text. (and probably
bogus ones if its been shortened).
I think we should what WDF does here, if the original offsets have already been
changed (startOffset + termLength != endOffset), then we should simply preserve
them for the new subwords.
I added a check for this to basetokenstreamtestcase... now to see if anything
else fails...
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
> Key: LUCENE-3642
> URL: https://issues.apache.org/jira/browse/LUCENE-3642
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5
> Reporter: Robert Muir
> Attachments: 6B2Uh.png, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an
> error.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]