[jira] [Commented] (LUCENE-7758) EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original tokens

Adrien Grand (JIRA) Thu, 30 Mar 2017 02:08:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948713#comment-15948713
 ]


Adrien Grand commented on LUCENE-7758:
--------------------------------------

bq.  This behaviour is irrational

Well, this is not exactly true. This is a token filter, meaning it can be 
applied on top on any set of other token filters. Now imagine that someone is 
applying edge n-grams on top of synonyms, this could generate broken offsets 
(going backwards for instance) so keeping the original offsets is the only safe 
move. A workaround to this issue is to use the (edge) n-gram tokenizers (as 
opposed to filters), which also have a protected {{isTokenChar}} method that 
can be overriden in case you want to eg. split on whitespaces.

> EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original 
> tokens
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-7758
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7758
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.4.1
>         Environment: elasticsearch-5.3
>            Reporter: Mikhail Bystryantsev
>              Labels: EdgeNGramTokenFilter, highlighting
>
> When EdgeNGramTokenFilter produces new tokens, they inherit end positions 
> from parent tokens. This behaviour is irrational and breaks highlighting: 
> highlighted not matched pattern, but whole source tokens.
> Seems like similar problem was fixed in LUCENE-3642, but end offsets was 
> broken again after LUCENE-3907.
> Some discussion was found in SOLR-7926:
> {quote}I agree this (highlighting of hits from tokens produced by
> EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to
> fix it.
> The stacking seems more correct: all these grams are logically
> interchangeable with the original token, and were derived from it, so
> e.g. a phrase query involving them with adjacent tokens would work
> correctly.
> We could perhaps remove the token graph requirement that tokens
> leaving from the same node have the same startOffset, and arriving to
> the same node have the same endOffset. Lucene would still be able to
> index such a graph, as long as all tokens leaving a given node are
> sorted according to their startOffset. But I'm not sure if there
> would be other problems...
> Or we could maybe improve the token graph, at least for the non-edge
> NGramTokenFilter, so that the grams are linked up correctly, so that any
> path through the graph reconstructs the original characters.
> But realistically it's not possible to innovate much with token graphs
> in Lucene today because of apparently severe back compat requirements:
> e.g. LUCENE-6664, which fixes the token graph bugs in the existing
> SynonymFilter so that proximity queries work correctly when using
> search-time synonyums, is blocked because of the back compat concerns
> from LUCENE-6721.
> I'm not sure what the path forward is...{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7758) EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original tokens

Reply via email to