[
https://issues.apache.org/jira/browse/LUCENE-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949095#comment-15949095
]
Mikhail Bystryantsev commented on LUCENE-7758:
----------------------------------------------
{quote}Assigning offsets is the responsibility of tokenizers. Tokenfilters
should just look at tokens and modify them, but not split them or change their
offsets.{quote}
But tokenizer can be *only one*, so there is no way to get tokens different
than produced by specific single tokenizer. No way to customize without writing
your own tokenizers. It is possible to combine token filters, but not
tokenizers.
{quote}In addition, highlighting is not meant to produce "exact" explanations
of every analysis step. It is more meant to allow highlighting whole tokens
afterwards, so the user has an idea, which token was responsible for a
hit.{quote}
I think this should be decided by Lucene users, not by anyone else. When you
project your index and search behaviour, only you can decide how it should work
based on your project requirements.
> EdgeNGramTokenFilter breaks highlighting by keeping end offsets of original
> tokens
> ----------------------------------------------------------------------------------
>
> Key: LUCENE-7758
> URL: https://issues.apache.org/jira/browse/LUCENE-7758
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.4.1
> Environment: elasticsearch-5.3
> Reporter: Mikhail Bystryantsev
> Labels: EdgeNGramTokenFilter, highlighting
>
> When EdgeNGramTokenFilter produces new tokens, they inherit end positions
> from parent tokens. This behaviour is irrational and breaks highlighting:
> highlighted not matched pattern, but whole source tokens.
> Seems like similar problem was fixed in LUCENE-3642, but end offsets was
> broken again after LUCENE-3907.
> Some discussion was found in SOLR-7926:
> {quote}I agree this (highlighting of hits from tokens produced by
> EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to
> fix it.
> The stacking seems more correct: all these grams are logically
> interchangeable with the original token, and were derived from it, so
> e.g. a phrase query involving them with adjacent tokens would work
> correctly.
> We could perhaps remove the token graph requirement that tokens
> leaving from the same node have the same startOffset, and arriving to
> the same node have the same endOffset. Lucene would still be able to
> index such a graph, as long as all tokens leaving a given node are
> sorted according to their startOffset. But I'm not sure if there
> would be other problems...
> Or we could maybe improve the token graph, at least for the non-edge
> NGramTokenFilter, so that the grams are linked up correctly, so that any
> path through the graph reconstructs the original characters.
> But realistically it's not possible to innovate much with token graphs
> in Lucene today because of apparently severe back compat requirements:
> e.g. LUCENE-6664, which fixes the token graph bugs in the existing
> SynonymFilter so that proximity queries work correctly when using
> search-time synonyums, is blocked because of the back compat concerns
> from LUCENE-6721.
> I'm not sure what the path forward is...{quote}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]