[ https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948713#comment-15948713 ]
Adrien Grand commented on LUCENE-7758: -------------------------------------- bq. This behaviour is irrational Well, this is not exactly true. This is a token filter, meaning it can be applied on top on any set of other token filters. Now imagine that someone is applying edge n-grams on top of synonyms, this could generate broken offsets (going backwards for instance) so keeping the original offsets is the only safe move. A workaround to this issue is to use the (edge) n-gram tokenizers (as opposed to filters), which also have a protected {{isTokenChar}} method that can be overriden in case you want to eg. split on whitespaces. > EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original > tokens > ---------------------------------------------------------------------------------- > > Key: LUCENE-7758 > URL: https://issues.apache.org/jira/browse/LUCENE-7758 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 6.4.1 > Environment: elasticsearch-5.3 > Reporter: Mikhail Bystryantsev > Labels: EdgeNGramTokenFilter, highlighting > > When EdgeNGramTokenFilter produces new tokens, they inherit end positions > from parent tokens. This behaviour is irrational and breaks highlighting: > highlighted not matched pattern, but whole source tokens. > Seems like similar problem was fixed in LUCENE-3642, but end offsets was > broken again after LUCENE-3907. > Some discussion was found in SOLR-7926: > {quote}I agree this (highlighting of hits from tokens produced by > EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to > fix it. > The stacking seems more correct: all these grams are logically > interchangeable with the original token, and were derived from it, so > e.g. a phrase query involving them with adjacent tokens would work > correctly. > We could perhaps remove the token graph requirement that tokens > leaving from the same node have the same startOffset, and arriving to > the same node have the same endOffset. Lucene would still be able to > index such a graph, as long as all tokens leaving a given node are > sorted according to their startOffset. But I'm not sure if there > would be other problems... > Or we could maybe improve the token graph, at least for the non-edge > NGramTokenFilter, so that the grams are linked up correctly, so that any > path through the graph reconstructs the original characters. > But realistically it's not possible to innovate much with token graphs > in Lucene today because of apparently severe back compat requirements: > e.g. LUCENE-6664, which fixes the token graph bugs in the existing > SynonymFilter so that proximity queries work correctly when using > search-time synonyums, is blocked because of the back compat concerns > from LUCENE-6721. > I'm not sure what the path forward is...{quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org