[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991320#comment-12991320 ]
Robert Muir commented on LUCENE-2909: ------------------------------------- You are right, some stemmers increase the size, so this assumption that end - start = termAtt.length is a problem. So, between this and LUCENE-2208, I think we need to add some more checks/asserts to BaseTokenStreamTestCase (at least to validate offset < end, but maybe some other ideas?) If the highlighter hits this condition, it (rightfully) complains and throws an exception, among other problems. So I think we need to improve this situation everywhere. > NGramTokenFilter may generate offsets that exceed the length of original text > ----------------------------------------------------------------------------- > > Key: LUCENE-2909 > URL: https://issues.apache.org/jira/browse/LUCENE-2909 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Affects Versions: 2.9.4 > Reporter: Shinya Kasatani > Assignee: Koji Sekiguchi > Priority: Minor > Attachments: TokenFilterOffset.patch > > > Whan using NGramTokenFilter combined with CharFilters that lengthen the > original text (such as "ß" -> "ss"), the generated offsets exceed the length > of the origianal text. > This causes InvalidTokenOffsetsException when you try to highlight the text > in Solr. > While it is not possible to know the accurate offset of each character once > you tokenize the whole text with tokenizers like KeywordTokenizer, > NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org