[
https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15133945#comment-15133945
]
Bjørn Hjelle commented on SOLR-7926:
------------------------------------
Ok, opened as a Lucene issue as LUCENE-7016.
Thanks!
> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
> Key: SOLR-7926
> URL: https://issues.apache.org/jira/browse/SOLR-7926
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Affects Versions: 5.1, 5.2.1
> Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
> Reporter: Bjørn Hjelle
> Priority: Critical
> Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the
> search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
> <fieldType name="text_ngram" class="solr.TextField">
> <analyzer type="index">
> <charFilter
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> <!--tokenizer
> class="solr.StandardTokenizerFactory"/-->
> <filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory"
> maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
> replacement="" replace="all"/>
> </analyzer>
> <analyzer type="query">
> <charFilter
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer
> class="solr.StandardTokenizerFactory"/>
> <filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
> replacement="" replace="all"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
> replacement="$1" replace="all"/>
> </analyzer>
> </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it
> shows this:
> LENGTF text luc luce lucen lucene
> raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c
> 75 63 65 6e 65]
> start 0 0 0 0
> end 6 6 6 6
> positionLength 1 1 1 1
> type word word word word
> position 1 1 1 1
> Since the end position is 6 in this case the whole word ("lucene" is
> highlighted).
>
> If I change to use NGramFilterFactory it shows me this (for the first three
> items):
> LENGTF text luc uce cen
> raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e]
> start 0 1 2
> end 3 4 5
> positionLength 1 1 1
> type word word word
> position 1 1 1
> The end position is correct then and the highlighter highlights only the
> search term. Note that I have specified luceneMatchVersion="4.3". Without
> this, the end positions goes back to 6 also for the NGramFilterFactory.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]