[ 
https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949039#comment-15949039
 ] 

Uwe Schindler commented on LUCENE-7758:
---------------------------------------

bq. Moreover, I would not be surprised that highlighting the entire token is a 
desired behaviour for some users.

This is correct. Modifying offsets inside a TokenFilter is not going to be 
correct for highlighting for the reasons you are mentioning. This is a general 
issue with all token filters that are splitting tokens: The "famous" example is 
WordDelimiterFilter.

Assigning offsets is the responsibility of tokenizers. Tokenfilters should just 
look at tokens and modify them, but not split them or change their offsets. 

In addition, highlighting is not meant to produce "exact" explanations of every 
analysis step. It is more meant to allow highlighting whole tokens afterwards, 
so the user has an idea, which token was responsible for a hit.

> EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original 
> tokens
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-7758
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7758
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.4.1
>         Environment: elasticsearch-5.3
>            Reporter: Mikhail Bystryantsev
>              Labels: EdgeNGramTokenFilter, highlighting
>
> When EdgeNGramTokenFilter produces new tokens, they inherit end positions 
> from parent tokens. This behaviour is irrational and breaks highlighting: 
> highlighted not matched pattern, but whole source tokens.
> Seems like similar problem was fixed in LUCENE-3642, but end offsets was 
> broken again after LUCENE-3907.
> Some discussion was found in SOLR-7926:
> {quote}I agree this (highlighting of hits from tokens produced by
> EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to
> fix it.
> The stacking seems more correct: all these grams are logically
> interchangeable with the original token, and were derived from it, so
> e.g. a phrase query involving them with adjacent tokens would work
> correctly.
> We could perhaps remove the token graph requirement that tokens
> leaving from the same node have the same startOffset, and arriving to
> the same node have the same endOffset. Lucene would still be able to
> index such a graph, as long as all tokens leaving a given node are
> sorted according to their startOffset. But I'm not sure if there
> would be other problems...
> Or we could maybe improve the token graph, at least for the non-edge
> NGramTokenFilter, so that the grams are linked up correctly, so that any
> path through the graph reconstructs the original characters.
> But realistically it's not possible to innovate much with token graphs
> in Lucene today because of apparently severe back compat requirements:
> e.g. LUCENE-6664, which fixes the token graph bugs in the existing
> SynonymFilter so that proximity queries work correctly when using
> search-time synonyums, is blocked because of the back compat concerns
> from LUCENE-6721.
> I'm not sure what the path forward is...{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to