[
https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697791#comment-14697791
]
David Smiley commented on SOLR-7926:
------------------------------------
bq. We could perhaps remove the token graph requirement that tokens leaving
from the same node have the same startOffset, and arriving to the same node
have the same endOffset. Lucene would still be able to index such a graph, as
long as all tokens leaving a given node are sorted according to their
startOffset. But I'm not sure if there would be other problems...
I think that direction seems best to me. But yes there may be problems we
don't see yet that may show up once we try. Not a blocker; just some unknown
unknowns that will become known :-)
> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
> Key: SOLR-7926
> URL: https://issues.apache.org/jira/browse/SOLR-7926
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Affects Versions: 5.1, 5.2.1
> Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
> Reporter: Bjørn Hjelle
> Priority: Critical
> Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the
> search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
> <fieldType name="text_ngram" class="solr.TextField">
> <analyzer type="index">
> <charFilter
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
> <!--tokenizer
> class="solr.StandardTokenizerFactory"/-->
> <filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory"
> maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
> replacement="" replace="all"/>
> </analyzer>
> <analyzer type="query">
> <charFilter
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer
> class="solr.StandardTokenizerFactory"/>
> <filter
> class="solr.WordDelimiterFilterFactory" generateWordParts="0"
> generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])"
> replacement="" replace="all"/>
> <filter
> class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?"
> replacement="$1" replace="all"/>
> </analyzer>
> </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it
> shows this:
> LENGTF text luc luce lucen lucene
> raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c
> 75 63 65 6e 65]
> start 0 0 0 0
> end 6 6 6 6
> positionLength 1 1 1 1
> type word word word word
> position 1 1 1 1
> Since the end position is 6 in this case the whole word ("lucene" is
> highlighted).
>
> If I change to use NGramFilterFactory it shows me this (for the first three
> items):
> LENGTF text luc uce cen
> raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e]
> start 0 1 2
> end 3 4 5
> positionLength 1 1 1
> type word word word
> position 1 1 1
> The end position is correct then and the highlighter highlights only the
> search term. Note that I have specified luceneMatchVersion="4.3". Without
> this, the end positions goes back to 6 also for the NGramFilterFactory.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]