[ 
https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642_test.patch

here's a test.

the problem is a previous filter 'lengthens' this term by folding æ -> ae, but 
EdgeNGramFilter computes the offsets "additively": offsetAtt.setOffset(tokStart 
+ start, tokStart + end);

Because of this if a word has been 'lengthened' by a previous filter, edgengram 
will produce offsets that are longer than the original text. (and probably 
bogus ones if its been shortened).

I think we should what WDF does here, if the original offsets have already been 
changed (startOffset + termLength != endOffset), then we should simply preserve 
them for the new subwords.

I added a check for this to basetokenstreamtestcase... now to see if anything 
else fails... 
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an 
> error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to